Home |  DEUTSCH |  Contact |  Imprint |  Data Protection |  Login |  KIT

HistorEx: Exploring Historical Text Corpora Using Word and Document Embeddings

Aus Aifbportal

Wechseln zu: Navigation, Suche


HistorEx: Exploring Historical Text Corpora Using Word and Document Embeddings




Published: 2019 Juni


BibTeX

Kurzfassung
Written text can be understood as a means to acquire insights into the nature of past and present cultures and societies. Numerous projects have been devoted to digitizing and publishing historical textual documents in digital libraries which scientists can utilize as valuable resources for research. However, the extent of textual data available exceeds humans' abilities to explore the data efficiently. In this paper, a framework is presented which combines unsupervised machine learning techniques and natural language processing on the example of historical text documents on the 19th century of the USA. Named entities are extracted from semi-structured text, which is enriched with complementary information from Wikidata. Word embeddings are leveraged to enable further analysis of the text corpus, which is visualized in a web-based application. Keywords: Word Embeddings - Document Vectors - Wikidata - Cultural Heritage - Visualization - Recommender System



Forschungsgruppe

Information Service Engineering


Forschungsgebiet