Published: 2013 Dezember
Institution: Institut AIFB, KIT
Erscheinungsort / Ort: Karlsruhe
Instance matching is an important step in data integration where the goal is to find instance representations referring to the same entity. In this paper, we propose an efficient approach to learn attributes, similarity functions, and thresh- olds, called instance-matching rules, for finding matches. Existing rule-based approaches calculate similarity of each attribute separately, and identify an instance pair as a match if each of the similarities is high enough. They may fail to identify matching instance pairs if there are errors occur in a single attribute. Besides, these approach cannot effectively learn the rules without the fine-tuning of parameters. At mean while, these approaches are also expensive in learning, because they learn the best rule from a large number of candidates whose number depends on the number of attributes, similarity functions, and especially training examples. In this paper, we address these three problems. We measure two instances as a whole by calculating the average similarity of a set of attributes to balance the errors in single one. The approach we proposed in this paper is almost free of parameters, which can easily estimate the value of the parameters from the training data and require not fine-tuning of them. We then propose an efficient algorithm to learn the instance-matching rules from a significantly smaller set of candidates whose size only depends on the number of at- tributes and similarity functions. The experiments on both real and synthetic datasets show that our solution greatly improves the effectiveness as well as the efficiency by up to 87% reduction of learning time. Moreover, the approach is also effective in the way that it can achieve stable results when the parameters are set with a large range of different values.