Searching the Web for Cross-lingual Parallel Data

被引:4
作者
El-Kishky, Ahmed [1 ]
Koehn, Philipp [2 ]
Schwenk, Holger [1 ]
机构
[1] Facebook AI, Seattle, WA 98109 USA
[2] Johns Hopkins Univ, Baltimore, MD USA
来源
PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20) | 2020年
关键词
cross-lingual document retrieval; cross-lingual sentence retrieval; machine translation; multilingual embedding; web mining;
D O I
10.1145/3397271.3401417
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
While the World Wide Web provides a large amount of text in many languages, cross-lingual parallel data is more difficult to obtain. Despite its scarcity, this parallel cross-lingual data plays a crucial role in a variety of tasks in natural language processing with applications in machine translation, cross-lingual information retrieval, and document classification, as well as learning cross-lingual representations. Here, we describe the end-to-end process of searching the web for parallel cross-lingual texts. We motivate obtaining parallel text as a retrieval problem whereby the goal is to retrieve cross-lingual parallel text from a large, multilingual web-crawled corpus. We introduce techniques for searching for cross-lingual parallel data based on language, content, and other metadata. We motivate and introduce multilingual sentence embeddings as a core tool and demonstrate techniques and models that leverage them for identifying parallel documents and sentences as well as techniques for retrieving and filtering this data. We describe several large-scale datasets curated using these techniques and show how training on sentences extracted from parallel or comparable documents mined from the Web can improve machine translation models and facilitate cross-lingual NLP.
引用
收藏
页码:2417 / 2420
页数:4
相关论文
共 37 条
[1]  
[Anonymous], 2001, Proceedings of HLT2001, First International Conference on Human Language Technology Research
[2]  
[Anonymous], 2007, AMSTERDAM STUDIES TH, DOI DOI 10.1075/CILT.292.32VAR
[3]   Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond [J].
Artetxe, Mikel ;
Schwenk, Holger .
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2019, 7 :597-610
[4]  
Braune Fabienne, 2010, Coling 2010: Posters, P81
[5]  
Buck Christian, 2016, WMT 2016, V2, P672, DOI DOI 10.18653/V1/W16-2365
[6]  
Chaudhary V, 2019, FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT 2019), VOL 3: SHARED TASK PAPERS, DAY 2, P261
[7]  
Chen XL, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P261
[8]  
Common Crawl, 2018, COMMON CRAWL
[9]  
Conneau A, 2019, ADV NEUR IN, V32
[10]  
Conneau Alexis, 2019, arXiv preprint arXiv:1911.02116, DOI DOI 10.18653/V1/2020.ACL-MAIN.747