Stage-oe-small.jpg

Inproceedings3867: Unterschied zwischen den Versionen

Aus Aifbportal
Wechseln zu:Navigation, Suche
(Die Seite wurde neu angelegt: „{{Publikation Erster Autor |ErsterAutorNachname=Krause |ErsterAutorVorname=Johan }} {{Publikation Author |Rank=2 |Author=Igor Shapiro }} {{Publikation Author |…“)
 
 
Zeile 27: Zeile 27:
 
{{Publikation Details
 
{{Publikation Details
 
|Abstract=Applications based on scholarly data are of ever increasing importance. This results in disadvantages for areas where high-quality data and compatible systems are not available, such as non-English publications. To advance the mitigation of this imbalance, we use Cyrillic script publications from the CORE collection to create a high-quality data set for metadata extraction. We utilize our data for training and evaluating sequence labeling models to extract title and author information. Retraining GROBID on our data, we observe significant improvements in terms of precision and recall and achieve even better results with a self developed model. We make our data set covering over 15,000 publications as well as our source code freely available.
 
|Abstract=Applications based on scholarly data are of ever increasing importance. This results in disadvantages for areas where high-quality data and compatible systems are not available, such as non-English publications. To advance the mitigation of this imbalance, we use Cyrillic script publications from the CORE collection to create a high-quality data set for metadata extraction. We utilize our data for training and evaluating sequence labeling models to extract title and author information. Retraining GROBID on our data, we observe significant improvements in terms of precision and recall and achieve even better results with a self developed model. We make our data set covering over 15,000 publications as well as our source code freely available.
|Download=19_Paper.pdf
+
|Download=Cyrillic_SDP2021.pdf
 
|Forschungsgruppe=Web Science
 
|Forschungsgruppe=Web Science
 
}}
 
}}

Aktuelle Version vom 26. April 2021, 13:25 Uhr


Bootstrapping Multilingual Metadata Extraction: A Showcase in Cyrillic


Bootstrapping Multilingual Metadata Extraction: A Showcase in Cyrillic



Published: 2021

Buchtitel: Proceedings of the Second Workshop on Scholarly Document Processing
Verlag: CEUR-WS

Referierte Veröffentlichung

BibTeX


Kurzfassung
Applications based on scholarly data are of ever increasing importance. This results in disadvantages for areas where high-quality data and compatible systems are not available, such as non-English publications. To advance the mitigation of this imbalance, we use Cyrillic script publications from the CORE collection to create a high-quality data set for metadata extraction. We utilize our data for training and evaluating sequence labeling models to extract title and author information. Retraining GROBID on our data, we observe significant improvements in terms of precision and recall and achieve even better results with a self developed model. We make our data set covering over 15,000 publications as well as our source code freely available.

Download: Media:Cyrillic_SDP2021.pdf


Verknüpfte Datasets

Cyrillic Script Publication Metadata Extraction


Forschungsgruppe

Web Science


Forschungsgebiet