Pre-Training on Mixed Data for Low-Resource Neural Machine Translation

被引:6
作者
Zhang, Wenbo [1 ,2 ,3 ]
Li, Xiao [1 ,2 ,3 ]
Yang, Yating [1 ,2 ,3 ]
Dong, Rui [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci, Xinjiang Tech Inst Phys & Chem, Urumqi 830011, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] Xinjiang Lab Minor Speech & Language Informat Pro, Urumqi 830011, Peoples R China
关键词
neural machine translation; pre-training; low resource; word translation;
D O I
10.3390/info12030133
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The pre-training fine-tuning mode has been shown to be effective for low resource neural machine translation. In this mode, pre-training models trained on monolingual data are used to initiate translation models to transfer knowledge from monolingual data into translation models. In recent years, pre-training models usually take sentences with randomly masked words as input, and are trained by predicting these masked words based on unmasked words. In this paper, we propose a new pre-training method that still predicts masked words, but randomly replaces some of the unmasked words in the input with their translation words in another language. The translation words are from bilingual data, so that the data for pre-training contains both monolingual data and bilingual data. We conduct experiments on Uyghur-Chinese corpus to evaluate our method. The experimental results show that our method can make the pre-training model have a better generalization ability and help the translation model to achieve better performance. Through a word translation task, we also demonstrate that our method enables the embedding of the translation model to acquire more alignment knowledge.
引用
收藏
页数:10
相关论文
共 50 条
[31]   Augmenting Training Data for Low-Resource Neural Machine Translation via Bilingual Word Embeddings and BERT Language Modelling [J].
Ramesh, Akshai ;
Uhana, Haque Usuf ;
Parthasarathy, Venkatesh Balavadhani ;
Haque, Rejwanul ;
Way, Andy .
2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
[32]   Semantic Perception-Oriented Low-Resource Neural Machine Translation [J].
Wu, Nier ;
Hou, Hongxu ;
Li, Haoran ;
Chang, Xin ;
Jia, Xiaoning .
MACHINE TRANSLATION, CCMT 2021, 2021, 1464 :51-62
[33]   Neural machine translation for low-resource languages without parallel corpora [J].
Karakanta, Alina ;
Dehdari, Jon ;
van Genabith, Josef .
MACHINE TRANSLATION, 2018, 32 (1-2) :167-189
[34]   Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data [J].
Tonja, Atnafu Lambebo ;
Kolesnikova, Olga ;
Gelbukh, Alexander ;
Sidorov, Grigori .
APPLIED SCIENCES-BASEL, 2023, 13 (02)
[35]   Hierarchical Transfer Learning Architecture for Low-Resource Neural Machine Translation [J].
Luo, Gongxu ;
Yang, Yating ;
Yuan, Yang ;
Chen, Zhanheng ;
Ainiwaer, Aizimaiti .
IEEE ACCESS, 2019, 7 :154157-154166
[36]   Better Low-Resource Machine Translation with Smaller Vocabularies [J].
Signoroni, Edoardo ;
Rychly, Pavel .
TEXT, SPEECH, AND DIALOGUE, TSD 2024, PT I, 2024, 15048 :184-195
[37]   A Study for Enhancing Low-resource Thai-Myanmar-English Neural Machine Translation [J].
San, Mya Ei ;
Usanavasin, Sasiporn ;
Thu, Ye Kyaw ;
Okumura, Manabu .
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (04)
[38]   Translation Memories as Baselines for Low-Resource Machine Translation [J].
Knowles, Rebecca ;
Littell, Patrick .
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, :6759-6767
[39]   Keeping Models Consistent between Pretraining and Translation for Low-Resource Neural Machine Translation [J].
Zhang, Wenbo ;
Li, Xiao ;
Yang, Yating ;
Dong, Rui ;
Luo, Gongxu .
FUTURE INTERNET, 2020, 12 (12) :1-13
[40]   Augmenting Low-Resource Text Classification with Graph-Grounded Pre-training and Prompting [J].
Wen, Zhihao ;
Fang, Yuan .
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, :506-516