Published: 2013 Mai
Institution: Institute AIFB, KIT
Erscheinungsort / Ort: Karlsruhe
The Resource Description Framework (RDF) has become an accepted standard for describing entities on the Web. At the same time, many RDF descriptions today are text- rich – besides structured data, they also feature large portions of unstructured text. Such semi-structured data is frequently queried using predicates matching structured data, combined with string predicates for textual constraints: hybrid queries. Evaluating hybrid queries efficiently requires effective means for selectivity estimation. Previous works on selectivity estimation, however, target either structured or unstructured data alone. In contrast, we study the prob- lem in a uniform manner by exploiting a topic model as data synopsis, which enables us to accurately capture correlations between structured and unstructured data. Relying on this synopsis, our novel topic-based approach (TopGuess) uses as small, fine-grained query-specific Bayesian network (BN). In experiments on real-world data we show that the query-specific BN allows for great improvements in estimation accuracy. Compared to a baseline relying on PRMs we could achieve a gain of 20%. In terms of efficiency TopGuess performed comparable to our baselines.