LAPCA: Language-Agnostic Pretraining with Cross-Lingual Alignment

被引:1
作者
Abulkhanov, Dmitry [1 ]
Sorokin, Nikita [1 ]
Nikolenko, Sergey [2 ,3 ]
Malykh, Valentin [1 ]
机构
[1] Huawei Noahs Ark Lab, Moscow, Russia
[2] RAS, Ivannikov Inst Syst Programming, Moscow, Russia
[3] RAS, Steklov Inst Math, St Petersburg Dept, St Petersburg, Russia
来源
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023 | 2023年
关键词
cross-lingual IR; multilingual IR; Transformer-based architectures;
D O I
10.1145/3539618.3592006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data collection and mining is a crucial bottleneck for cross-lingual information retrieval (CLIR). While previous works used machine translation and iterative training, we present a novel approach to cross-lingual pretraining called LAPCA (language-agnostic pretraining with cross-lingual alignment). We train the LAPCA-LM model based on XLM-RoBERTa and LAPCA that significantly improves cross-lingual knowledge transfer for question answering and sentence retrieval on, e.g., XOR-TyDi and Mr. TyDi datasets, and in the zero-shot cross-lingual scenario performs on par with supervised methods, outperforming many of them on MKQA.
引用
收藏
页码:2098 / 2102
页数:5
相关论文
共 35 条
  • [1] Artetxe M, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P4623
  • [2] Asai Akari, 2021, P NEURIPS 2021
  • [3] Asai Akari, 2021, P NAACL HLT 2021
  • [4] Bonifacio Luiz Henrique, 2021, ABS210813897 CORR
  • [5] TYDI QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages
    Clark, Jonathan H.
    Choi, Eunsol
    Collins, Michael
    Garrette, Dan
    Kwiatkowski, Tom
    Nikolaev, Vitaly
    Palomaki, Jennimaria
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2020, 8 : 454 - 470
  • [6] Conneau A., 2020, P 58 ANN M ASS COMP, P8440, DOI DOI 10.18653/V1/2020.ACL-MAIN.747
  • [7] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [8] Fan Angela, 2020, ENGLISH CENTRIC MULT, DOI [10.48550/ARXIV.2010.11125, DOI 10.48550/ARXIV.2010.11125]
  • [9] Gao Luyu, 2021, Unsupervised corpus aware language model pre-training for dense passage retrieval
  • [10] Gillick Daniel, 2019, ABS190910506 CORR