Exploring the Data Efficiency of Cross-Lingual Post-Training in Pretrained Language Models

被引:6
作者
Lee, Chanhee [1 ]
Yang, Kisu [1 ]
Whang, Taesun [2 ]
Park, Chanjun [1 ]
Matteson, Andrew
Lim, Heuiseok [1 ]
机构
[1] Korea Univ, Dept Comp Sci & Engn, 145 Anam Ro, Seoul 02841, South Korea
[2] Wisenut Inc, 49 Daewangpangyo Ro 644 Beon Gil, Seongnam Si 13493, Gyeonggi Do, South Korea
来源
APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 05期
关键词
cross-lingual; pretraining; language model; transfer learning; deep learning; RoBERTa;
D O I
10.3390/app11051974
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Language model pretraining is an effective method for improving the performance of downstream natural language processing tasks. Even though language modeling is unsupervised and thus collecting data for it is relatively less expensive, it is still a challenging process for languages with limited resources. This results in great technological disparity between high- and low-resource languages for numerous downstream natural language processing tasks. In this paper, we aim to make this technology more accessible by enabling data efficient training of pretrained language models. It is achieved by formulating language modeling of low-resource languages as a domain adaptation task using transformer-based language models pretrained on corpora of high-resource languages. Our novel cross-lingual post-training approach selectively reuses parameters of the language model trained on a high-resource language and post-trains them while learning language-specific parameters in the low-resource language. We also propose implicit translation layers that can learn linguistic differences between languages at a sequence level. To evaluate our method, we post-train a RoBERTa model pretrained in English and conduct a case study for the Korean language. Quantitative results from intrinsic and extrinsic evaluations show that our method outperforms several massively multilingual and monolingual pretrained language models in most settings and improves the data efficiency by a factor of up to 32 compared to monolingual training.
引用
收藏
页码:1 / 15
页数:15
相关论文
共 55 条
[1]  
Artetxe M, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P4623
[2]   Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond [J].
Artetxe, Mikel ;
Schwenk, Holger .
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2019, 7 :597-610
[3]  
Artetxe Mikel, 2020, P 58 ANN M ASS COMPU, P7375, DOI [DOI 10.18653/V1/2020.ACL-MAIN.658, 10.18653/v1/2020.acl-main.658]
[4]  
Attardi G, 2015, WIKIEXTRACTOR
[5]  
Brown TB, 2020, ADV NEUR IN, V33
[6]  
Campbell L., 2007, GLOSSARY HIST LINGUI, P90
[7]  
Carbonell J.G., 2019, ARXIV191004708
[8]  
Cer D., 2017, P 11 INT WORKSH SEM, P1, DOI DOI 10.18653/V1/S17-2001
[9]  
Clark K, 2020, INT C LEARN REPR, DOI DOI 10.48550/ARXIV.2003.10555
[10]  
Conneau A., 2020, P 58 ANN M ASS COMP, P6022, DOI 10.18653/v1/2020.acl-main.536