Searching the Web for Cross-lingual Parallel Data

被引：4

作者：

El-Kishky, Ahmed ^{[1
]}

Koehn, Philipp ^{[2
]}

Schwenk, Holger ^{[1
]}

机构：

[1] Facebook AI, Seattle, WA 98109 USA

[2] Johns Hopkins Univ, Baltimore, MD USA

来源：

PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20) | 2020年

关键词：

cross-lingual document retrieval; cross-lingual sentence retrieval; machine translation; multilingual embedding; web mining;

D O I：

10.1145/3397271.3401417

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

While the World Wide Web provides a large amount of text in many languages, cross-lingual parallel data is more difficult to obtain. Despite its scarcity, this parallel cross-lingual data plays a crucial role in a variety of tasks in natural language processing with applications in machine translation, cross-lingual information retrieval, and document classification, as well as learning cross-lingual representations. Here, we describe the end-to-end process of searching the web for parallel cross-lingual texts. We motivate obtaining parallel text as a retrieval problem whereby the goal is to retrieve cross-lingual parallel text from a large, multilingual web-crawled corpus. We introduce techniques for searching for cross-lingual parallel data based on language, content, and other metadata. We motivate and introduce multilingual sentence embeddings as a core tool and demonstrate techniques and models that leverage them for identifying parallel documents and sentences as well as techniques for retrieving and filtering this data. We describe several large-scale datasets curated using these techniques and show how training on sentences extracted from parallel or comparable documents mined from the Web can improve machine translation models and facilitate cross-lingual NLP.

引用

页码：2417 / 2420

页数：4

共 37 条

[1]

[Anonymous], 2001, Proceedings of HLT2001, First International Conference on Human Language Technology Research

[2]

[Anonymous], 2007, AMSTERDAM STUDIES TH, DOI DOI 10.1075/CILT.292.32VAR

[3] Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond [J].

Artetxe, Mikel ;

Schwenk, Holger .

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2019, 7 :597-610

[4]

Braune Fabienne, 2010, Coling 2010: Posters, P81

[5]

Buck Christian, 2016, WMT 2016, V2, P672, DOI DOI 10.18653/V1/W16-2365

[6]

Chaudhary V, 2019, FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT 2019), VOL 3: SHARED TASK PAPERS, DAY 2, P261

[7]

Chen XL, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P261

[8]

Common Crawl, 2018, COMMON CRAWL

[9]

Conneau A, 2019, ADV NEUR IN, V32

[10]

Conneau Alexis, 2019, arXiv preprint arXiv:1911.02116, DOI DOI 10.18653/V1/2020.ACL-MAIN.747

← 1 2 3 4 →