LAPCA: Language-Agnostic Pretraining with Cross-Lingual Alignment

被引:1
作者
Abulkhanov, Dmitry [1 ]
Sorokin, Nikita [1 ]
Nikolenko, Sergey [2 ,3 ]
Malykh, Valentin [1 ]
机构
[1] Huawei Noahs Ark Lab, Moscow, Russia
[2] RAS, Ivannikov Inst Syst Programming, Moscow, Russia
[3] RAS, Steklov Inst Math, St Petersburg Dept, St Petersburg, Russia
来源
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023 | 2023年
关键词
cross-lingual IR; multilingual IR; Transformer-based architectures;
D O I
10.1145/3539618.3592006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data collection and mining is a crucial bottleneck for cross-lingual information retrieval (CLIR). While previous works used machine translation and iterative training, we present a novel approach to cross-lingual pretraining called LAPCA (language-agnostic pretraining with cross-lingual alignment). We train the LAPCA-LM model based on XLM-RoBERTa and LAPCA that significantly improves cross-lingual knowledge transfer for question answering and sentence retrieval on, e.g., XOR-TyDi and Mr. TyDi datasets, and in the zero-shot cross-lingual scenario performs on par with supervised methods, outperforming many of them on MKQA.
引用
收藏
页码:2098 / 2102
页数:5
相关论文
共 35 条
  • [21] Lewis Patrick, 2021, ARXIV200511401CSCL
  • [22] LI KH, 2022, J MACH LEARN, V22
  • [23] Li Yulong, 2021, LEARNING CROSS LINGU, DOI [10.48550/ARXIV.2112.08185, DOI 10.48550/ARXIV.2112.08185]
  • [24] Longpre Shayne, 2020, TACL
  • [25] M'hamdi M, 2021, 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), P3617
  • [26] Oguz Barlas, 2021, DOMAIN MATCHED PRETR, DOI [10.48550/ARXIV.2107.13602, DOI 10.48550/ARXIV.2107.13602]
  • [27] Pan Xiao, 2021, ABS210509501 ARXIV
  • [28] BLEU: a method for automatic evaluation of machine translation
    Papineni, K
    Roukos, S
    Ward, T
    Zhu, WJ
    [J]. 40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2002, : 311 - 318
  • [29] Qu Yingqi, 2021, ABS201008191 ARXIV
  • [30] Rajpurkar Pranav, 2016, ARXIV, DOI 10.18653/v1/D16-1264