Datum: 22. Juli 2011
KIT, Fakultät für Wirtschaftswissenschaften
Erscheinungsort / Ort: Karlsruhe
Referent(en): Rudi Studer, Philipp Cimiano
Information Retrieval (IR) deals with delivering relevant information items given the specific information needs of users. As retrieval problems are defined in various environments such as the World Wide Web, corporate knowledge bases or even personal desktops, IR is an every day problem that concerns almost everybody in our society. In this thesis, we present research results on the problem of Multilingual IR (MLIR), which defines retrieval scenarios that cross language borders. MLIR is a real-world problem which we motivate using different application scenarios, for example search systems having users with reading skills in several languages or expert retrieval.
As the main topic of this thesis, we consider how user-generated content that is assembled by different popular Web portals can be exploited for MLIR. These portals, prominent examples are Wikipedia or Yahoo! Answers, are built from the contributions of millions of users. We define the knowledge that can be derived from such portals as Social Semantics. Further, we identify important features of Social Semantics, namely the support of multiple languages, the broad coverage of topics and the ability to adapt to new topics. Based on these features, we argue that Social Semantics can be exploited as background knowledge to support multilingual retrieval systems.
Our main contribution is the integration of Social Semantics into multilingual retrieval models. Thereby, we present Cross-lingual Explicit Semantic Analysis, a semantic document representation that is based on interlingual concepts exploited from Wikipedia. Further, we propose a mixture language model that integrates different sources of evidence, including the knowledge encoded in the category structure of Yahoo! Answers.
For evaluation, we measure the benefit of the proposed retrieval models that exploit
Social Semantics. In our experiments, we apply these models to different established
datasets, which allows for the comparison to standard IR baselines and to
related approaches that are based on different kinds of background knowledge. As
standardized settings were not available for all the scenarios we considered, in particular
for multilingual Expert Retrieval, we further organized an international retrieval
challenge that allowed the evaluation of our proposed retrieval models which were
not covered by existing challenges.
Download: Media:Dissertation Sorg Philipp.pdf
Weitere Informationen unter: Link
Information Retrieval, Maschinelles Lernen, Natürliche Sprachverarbeitung