Bilingual Word Embeddings from Parallel and Non-parallel Corpora for Cross-Language Text Classification.

Published: 2016 Juni

Buchtitel: Human Language Technologies: The 2016 Annual Conference of the North American Chapter of the ACL.
Verlag: Association for Computational Linguistics
Organisation: NAACL HLT

Referierte Veröffentlichung

BibTeX

Kurzfassung
In many languages, sparse availability of resources causes numerous challenges for textual analysis tasks. Text classification is one of such standard tasks that is hindered due to limited availability of label information in low-resource languages. Transferring knowledge (i.e. label information) from high-resource to low-resource languages might improve text classification as compared to the other approaches like machine translation. We introduce BRAVE (Bilingual paRAgraph VEctors), a model to learn bilingual distributed representations (i.e. embeddings) of words without word alignments either from sentence-aligned parallel or label-aligned non-parallel document corpora to support cross-language text classification. Empirical analysis shows that classification models trained with our bilingual embeddings outperforms other state-of-the-art systems on three different cross-language text classification tasks.

Download: Media:NAACL-HLT-2016-Camera-Ready.pdf

Projekt

XLiMe

Forschungsgruppe

Web Science und Wissensmanagement

Forschungsgebiet

Inproceedings3517

Bilingual Word Embeddings from Parallel and Non-parallel Corpora for Cross-Language Text Classification.

Bilingual Word Embeddings from Parallel and Non-parallel Corpora for Cross-Language Text Classification.