Betreuer: Harald Sack, Fabian Müller
Forschungsgruppe: Information Service Engineering
Beginn: 01. Mai 2020
The Mathematical Subject Classification Scheme (MSC) is a three-level hierarchical classification scheme for mathemtics and neighbouring areas that is jointly developed and continuously refined by the two large information providers for mathematical publications, zbMATH and MathSciNet. At zbMATH, a product developed by FIZ Karlsruhe‘s Department of Mathematics located in Berlin, the MSC classification is used for structuring the internal workflow as well as for interlinking, filtering, and browsing functionality in the user interface.
Automated classification of text documents in terms of a fixed classification scheme is a classic task in Natural Language Processing. The MSC use case is unique in two respects: One is the use of information and formats specific to mathematics, in particular formulae encoded in LaTeX. The other is the presence of additional structured metadata providing semantic context, namely information about authors, their publication history and coauthor networks, and about journals and their area-specific orientation. Training data is available in the form of about 3.4 million hand-annotated documents.
Your task in this thesis will be to develop a supervised learning algorithm for classification of mathematical text according to the MSC classification scheme using suitable NLP tools able to take the additional information into account. The algorithm should be able to adapt both to additional training data as well as to updates of the underlying classification scheme (the next update driven by communitiy feedback is scheduled for 2020). The task is expected to require mastery of standard techniques as well as the development of new approaches specific to the use case. In addition to requiring scientific work suitable for a master‘s thesis, the finished algorithm is planned to be put into production at zbMATH after the end of the project.
Ausschreibung: Download (pdf)