Fine-Tuning Self-Supervised Multilingual Sequence-To-Sequence Models for Extremely Low-Resource NMT

被引:0
作者
Thillainathan, Sarubi [1 ]
Ranathunga, Surangika [1 ]
Jayasena, Sanath [1 ]
机构
[1] Univ Moratuwa, Dept Comp Sci & Engn, Katubedda, Sri Lanka
来源
MORATUWA ENGINEERING RESEARCH CONFERENCE (MERCON 2021) / 7TH INTERNATIONAL MULTIDISCIPLINARY ENGINEERING RESEARCH CONFERENCE | 2021年
关键词
neural machine translation; pre-trained models; fine-tuning; denoising autoencoder; low-resource languages; NEURAL MACHINE TRANSLATION; SINHALA; TAMIL;
D O I
10.1109/MERCON52712.2021.9525720
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Neural Machine Translation (NMT) tends to perform poorly in low-resource language settings due to the scarcity of parallel data. Instead of relying on inadequate parallel corpora, we can take advantage of monolingual data available in abundance. Training a denoising self-supervised multilingual sequence-to-sequence model by noising the available large scale monolingual corpora is one way to utilize monolingual data. For a pair of languages for which monolingual data is available in such a pre-trained multilingual denoising model, the model can be fine-tuned with a smaller amount of parallel data from this language pair. This paper presents fine-tuning self-supervised multilingual sequence-to-sequence pre-trained models for extremely low-resource domain-specific NMT settings. We choose one such pre-trained model: mBART. We are the first to implement and demonstrate the viability of non-English centric complete fine-tuning on multilingual sequence-to-sequence pretrained models. We select Sinhala, Tamil and English languages to demonstrate fine-tuning on extremely low-resource settings in the domain of official government documents. Experiments show that our fine-tuned mBART model significantly outperforms state-of-the-art Transformer based NMT models in all pairs in all six bilingual directions, where we report a 4.41 BLEU score increase on Tamil -> Sinhala and a 2.85 BLUE increase on Sinhala -> Tamil translation.
引用
收藏
页码:432 / 437
页数:6
相关论文
共 29 条
  • [1] Anand Kumar M., 2010, Int. J. Comput. Sci. Eng., V2, P1944
  • [2] Arukgoda A., 2019, NL4AI AI IA
  • [3] Cahyawijaya S., 2021, ARXIV PREPRINT ARXIV
  • [4] Clinchant S, 2019, P 3 WORKSH NEUR GEN, P108, DOI DOI 10.18653/V1/D19-5611
  • [5] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [6] Farhath F, 2018, 2018 MORATUWA ENGINEERING RESEARCH CONFERENCE (MERCON) 4TH INTERNATIONAL MULTIDISCIPLINARY ENGINEERING RESEARCH CONFERENCE, P538, DOI 10.1109/MERCon.2018.8421901
  • [7] Fernando A., 2020, ARXIV PREPRINT ARXIV
  • [8] Fonseka T, 2020, INT CONF ASIAN LANG, P305, DOI [10.1109/ialp51396.2020.9310462, 10.1109/IALP51396.2020.9310462]
  • [9] hi Z. C, 2021, ARXIV PREPRINT ARXIV
  • [10] Koehn P., 2007, P 45 ANN M ASS COMPU