Betreuer: Harald Sack, Fabian Hoppe
Forschungsgruppe: Information Service Engineering
Partner: FIZ Karlsruhe
Beginn: 01. Mai 2020
Abgabe: 12. Dezember 2020
Text classification is one of the most effective ways to organize rapidly growing textual data into fine-grained classes, which enables the retrieval and exploration of information. Recently, several supervised learning approaches have achieved promising results for text classification by utilizing pretrained language models and deep learning algorithms. Most of these approaches consider only a small amount of classes. However, many real world tasks have to consider a large number of classes, e.g., the English Wikipedia category systems, used to organize articles, consists of more than 100,000 categories. This poses additional challenges to the classification task, e.g., complex class relationships and class imbalances have to be included. In this thesis, the additional challenges of fine-grained text classification are investigated to provide an overview of state-of-the-art multi-label text classification approaches and support the current research on fine-grained text classification. As a first step, a set of relevant state-of-the-art feature extraction models is going to be identified based on a literature review. A common deep learning classifier (CNN, LSTM, etc.) is trained on fine-grained classification datasets using the identified features. The Wikipedia category dataset will be used. Finally, the evaluation of each feature extraction model will consider multiple numbers of classes to provide insights about the influence of the class granularity of these features.
Ausschreibung: Download (pdf)