Augmenting Training Data for Low-Resource Neural Machine Translation via Bilingual Word Embeddings and BERT Language Modelling

被引:0
作者
Ramesh, Akshai [1 ]
Uhana, Haque Usuf [2 ]
Parthasarathy, Venkatesh Balavadhani [3 ]
Haque, Rejwanul [4 ]
Way, Andy [5 ]
机构
[1] Icon Translat Machines, Invent Bldg,DCU Campus, Dublin 9, Ireland
[2] GCETT, Dept Comp Sci & Engn, Serampore, India
[3] KantanMT, Invent Bldg,DCU Campus, Dublin 9, Ireland
[4] Natl Coll Ireland, Adapt Ctr, Sch Comp, IFSC, Mayor Sq, Dublin 1, Ireland
[5] Dublin City Univ, Adapt Ctr, Sch Comp, Dublin 9, Ireland
来源
2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) | 2021年
基金
爱尔兰科学基金会;
关键词
Machine translation; Neural machine translation; Transformer; Language modelling;
D O I
10.1109/IJCNN52387.2021.9534211
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Neural machine translation (NMT) is often described as 'data hungry' as it typically requires large amounts of parallel data in order to build a good-quality machine translation (MT) system. However, most of the world's language-pairs are low-resource or extremely low-resource. This situation becomes even worse if a specialised domain is taken into consideration for translation. In this paper, we present a novel data augmentation method which makes use of bilingual word embeddings (BWEs) learned from monolingual corpora and bidirectional encoder representations from transformer (BERT) language models (LMs). We augment a parallel training corpus by introducing new words (i.e. out-of-vocabulary (OOV) items) and increasing the presence of rare words on both sides of the original parallel training corpus. Our experiments on the simulated low-resource German-English and French-English translation tasks show that the proposed data augmentation strategy can significantly improve state-of-the-art NMT systems and outperform the state-of-the-art data augmentation approach for low-resource NMT.
引用
收藏
页数:8
相关论文
共 42 条
[1]  
[Anonymous], 2019, RES PAPERS
[2]  
[Anonymous], 2007, C N AM CHAPTER ASS C
[3]  
[Anonymous], 2018, P AMTA 2018 WORKSH T
[4]  
Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]
[5]  
Barron L, 2019, PRACTICAL GUIDE FOR EDTPA IMPLEMENTATION: LESSONS FROM THE FIELD, P1
[6]  
Bogoychev Nikolay, 2019, Domain, Translationese and Noise in Synthetic Data for Neural Machine Translation
[7]  
Bojanowski P., 2017, Transactions of the association for computational linguistics, V5, P135, DOI DOI 10.1162/TACL_A_00051
[8]  
Braune Fabienne, 2018, NAACL, V2, P188
[9]  
Burlot Franck, 2018, P 3 C MACHINE TRANSL, P144
[10]  
Chinea-Rios M., 2017, P 2 C MACH TRANSL, P138