Published: 1999 November
Buchtitel: 1999 IEEE Knowledge and Data Engineering Exchange Workshop (KDEX'99), Chicago
The diversity and availability of information sources on the World Wide Web has set the stage for integration and reuse at an unparalleled scale. There remain signi.cant hurdles to exploiting the extent of the Web's resources in a consistent, scalable and maintainable fashion. The autonomy and volatility of Web sources complicates maintaining wrappers consistent with the requirements of the data's target application. Also, the sources' semantic heterogeneity requires practical methods to mediate their contents. This paper describes our e.orts in developing an algebra on semistructured data. This algebra is the tool we use to develop and maintain wrappers, and mediate their semantic content. We describe wrapper creation, re.nement and maintenance as the process of developing a congruity measure between source data sets and their target application. The congruity measure expresses explicitly the context within which the source data is relevant for its target use. Enabling mediation between wrappers corresponds to establishing an articulation between data sources through a similarity measure. The similarity measure encapsulates conditions under which distinct sources may be used together. Examples using the algebra show how the Summarize operator enables creation and maintenance of a target data set from a source. We de.ne the Match operator and demonstrate its use in mediating the content of similar sources. We have applied these two operators respectively to an online dictionary with over 125,000 di.erent terms, and to the web pages of NATO gov- ernments and their allies. We give a description of the rule language that we use to wrap and mediate sources, and conclude with the applications that motivate the other operations in the algebra.