Large-Scale Pattern-Based Information Extraction from the World Wide Web

Datum: 22. Januar 2010
KIT, Fakultät für Wirtschaftswissenschaften
Erscheinungsort / Ort: Karlsruhe
Referent(en): Prof. Dr. Rudi Studer

Extracting information from text is the task of obtaining structured, machine-processable facts from information that is mentioned in an unstructured manner. It thus allows systems to automatically aggregate information for further analysis, efficient retrieval, automatic validation, or appropriate visualization. Information Extraction systems require a model that describes how to identify relevant target information in texts. These models need to be adapted to the exact nature of the target information and to the nature of the textual input, which is typically accomplished by means of Machine Learning techniques that generate such models based on examples. One particular type of Information Extraction models are textual patterns. Textual patterns are underspecified explicit descriptions of text fragments. The automatic induction of such patterns from example text fragments which are known to contain target information is a common way to learn this type of extraction models.

This thesis explores the potential of using textual patterns for Information Extraction from the World Wide Web. We review and discuss a large body of related work by describing it within a common framework. Then, we empirically analyze the effects of a multitude of design choices in pattern-based Information Extraction systems. In particular, we investigate how patterns can be filtered appropriately. We show how corpora of different nature can be exploited beneficially and how the nature of the patterns influences extraction quality. Finally, we present new ways of mining textual patterns by modelling pattern induction as a well-understood type of Data Mining problems.

Informationsextraktion, Text Mining, Data Mining, Maschinelles Lernen