Scaling up Pattern Induction for Web Relation Extraction through Frequent Itemset Mining
Published: 2008 September
Herausgeber: Benjamin Adrian, Günter Neumann, Alexander Troussov, Borislav Popov
Buchtitel: Proceedings of the KI 2008 Workshop on Ontology-Based Information Extraction Systems
In this paper, we address the problem of extracting relational information from the Web at a large scale. In particular we present a bootstrapping approach to relation extraction which starts with a few seed tuples of the target relation and induces patterns which can be used to extract further tuples. Our contribution in this paper lies in the formulation of the pattern induction task as a well-known machine learning problem, i.e. the one of determining frequent itemsets on the basis of a set of transactions representing patterns. The formulation of the extraction problem as the task of mining frequent itemsets is not only elegant, but also speeds up the pattern induction step considerably with respect to previous implementations of the bootstrapping procedure. We evaluate our approach in terms of standard measures with respect to seven datasets of varying size and complexity. In particular, by analyzing the extraction rate (extracted tuples per time) we show that our approach reduces the pattern induction complexity from quadratic to linear (in the size of the occurrences to be generalized), while mantaining extraction quality at similar (or even marginally better) levels.