LAPCA: Language-Agnostic Pretraining with Cross-Lingual Alignment

被引:3
作者
Abulkhanov, Dmitry [1 ]
Sorokin, Nikita [1 ]
Nikolenko, Sergey [2 ,3 ]
Malykh, Valentin [1 ]
机构
[1] Huawei Noahs Ark Lab, Moscow, Russia
[2] RAS, Ivannikov Inst Syst Programming, Moscow, Russia
[3] RAS, Steklov Inst Math, St Petersburg Dept, St Petersburg, Russia
来源
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023 | 2023年
关键词
cross-lingual IR; multilingual IR; Transformer-based architectures;
D O I
10.1145/3539618.3592006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data collection and mining is a crucial bottleneck for cross-lingual information retrieval (CLIR). While previous works used machine translation and iterative training, we present a novel approach to cross-lingual pretraining called LAPCA (language-agnostic pretraining with cross-lingual alignment). We train the LAPCA-LM model based on XLM-RoBERTa and LAPCA that significantly improves cross-lingual knowledge transfer for question answering and sentence retrieval on, e.g., XOR-TyDi and Mr. TyDi datasets, and in the zero-shot cross-lingual scenario performs on par with supervised methods, outperforming many of them on MKQA.
引用
收藏
页码:2098 / 2102
页数:5
相关论文
共 35 条
[31]  
Sorokin Nikita, 2022, P NAACL 2022
[32]  
Tien Chih-chan, 2022, P 60 ANN M ASS COMP, V1, P8696, DOI [10.18653/v1/2022.acl- long.595, DOI 10.18653/V1/2022.ACL-L0NG.595]
[33]  
Tien Chih-chan, 2021, BILINGUAL ALIGNMENT, DOI [10.48550/ARXIV.2104.07642, DOI 10.48550/ARXIV.2104.07642]
[34]   Adversarial Training for Unsupervised Bilingual Lexicon Induction [J].
Zhang, Meng ;
Liu, Yang ;
Luan, Huanbo ;
Sun, Maosong .
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, :1959-1970
[35]  
Zhang Xinyu, 2021, MR TYDI MULTILINGUAL, DOI [10.48550/arXiv.2108.08787, DOI 10.48550/ARXIV.2108.08787]