Augmenting Training Data for Low-Resource Neural Machine Translation via Bilingual Word Embeddings and BERT Language Modelling

被引：0

作者：

Ramesh, Akshai ^{[1
]}

Uhana, Haque Usuf ^{[2
]}

Parthasarathy, Venkatesh Balavadhani ^{[3
]}

Haque, Rejwanul ^{[4
]}

Way, Andy ^{[5
]}

机构：

[1] Icon Translat Machines, Invent Bldg,DCU Campus, Dublin 9, Ireland

[2] GCETT, Dept Comp Sci & Engn, Serampore, India

[3] KantanMT, Invent Bldg,DCU Campus, Dublin 9, Ireland

[4] Natl Coll Ireland, Adapt Ctr, Sch Comp, IFSC, Mayor Sq, Dublin 1, Ireland

[5] Dublin City Univ, Adapt Ctr, Sch Comp, Dublin 9, Ireland

来源：

2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) | 2021年

基金：

爱尔兰科学基金会;

关键词：

Machine translation; Neural machine translation; Transformer; Language modelling;

D O I：

10.1109/IJCNN52387.2021.9534211

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Neural machine translation (NMT) is often described as 'data hungry' as it typically requires large amounts of parallel data in order to build a good-quality machine translation (MT) system. However, most of the world's language-pairs are low-resource or extremely low-resource. This situation becomes even worse if a specialised domain is taken into consideration for translation. In this paper, we present a novel data augmentation method which makes use of bilingual word embeddings (BWEs) learned from monolingual corpora and bidirectional encoder representations from transformer (BERT) language models (LMs). We augment a parallel training corpus by introducing new words (i.e. out-of-vocabulary (OOV) items) and increasing the presence of rare words on both sides of the original parallel training corpus. Our experiments on the simulated low-resource German-English and French-English translation tasks show that the proposed data augmentation strategy can significantly improve state-of-the-art NMT systems and outperform the state-of-the-art data augmentation approach for low-resource NMT.

引用

页数：8

共 42 条

[1]

[Anonymous], 2019, RES PAPERS

[2]

[Anonymous], 2007, C N AM CHAPTER ASS C

[3]

[Anonymous], 2018, P AMTA 2018 WORKSH T

[4]

Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]

[5]

Barron L, 2019, PRACTICAL GUIDE FOR EDTPA IMPLEMENTATION: LESSONS FROM THE FIELD, P1

[6]

Bogoychev Nikolay, 2019, Domain, Translationese and Noise in Synthetic Data for Neural Machine Translation

[7]

Bojanowski P., 2017, Transactions of the association for computational linguistics, V5, P135, DOI DOI 10.1162/TACL_A_00051

[8]

Braune Fabienne, 2018, NAACL, V2, P188

[9]

Burlot Franck, 2018, P 3 C MACHINE TRANSL, P144

[10]

Chinea-Rios M., 2017, P 2 C MACH TRANSL, P138

← 1 2 3 4 5 →