Neural machine translation of low-resource languages using SMT phrase pair injection

被引:20
作者
Sen, Sukanta [1 ]
Hasanuzzaman, Mohammed [2 ]
Ekbal, Asif [1 ]
Bhattacharyya, Pushpak [1 ]
Way, Andy [2 ]
机构
[1] Indian Inst Technol Patna, Patna, Bihar, India
[2] Dublin City Univ, ADAPT Ctr, Dublin, Ireland
关键词
Machine translation; Translation technology;
D O I
10.1017/S1351324920000303
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Neural machine translation (NMT) has recently shown promising results on publicly available benchmark datasets and is being rapidly adopted in various production systems. However, it requires high-quality large-scale parallel corpus, and it is not always possible to have sufficiently large corpus as it requires time, money, and professionals. Hence, many existing large-scale parallel corpus are limited to the specific languages and domains. In this paper, we propose an effective approach to improve an NMT system in low-resource scenario without using any additional data. Our approach aims at augmenting the original training data by means of parallel phrases extracted from the original training data itself using a statistical machine translation (SMT) system. Our proposed approach is based on the gated recurrent unit (GRU) and transformer networks. We choose the Hindi-English, Hindi-Bengali datasets for Health, Tourism, and Judicial (only for Hindi-English) domains. We train our NMT models for 10 translation directions, each using only 5-23k parallel sentences. Experiments show the improvements in the range of 1.38-15.36 BiLingual Evaluation Understudy points over the baseline systems. Experiments show that transformer models perform better than GRU models in low-resource scenarios. In addition to that, we also find that our proposed method outperforms SMT-which is known to work better than the neural models in low-resource scenarios-for some translation directions. In order to further show the effectiveness of our proposed model, we also employ our approach to another interesting NMT task, for example, old-to-modern English translation, using a tiny parallel corpus of only 2.7K sentences. For this task, we use publicly available old-modern English text which is approximately 1000 years old. Evaluation for this task shows significant improvement over the baseline NMT.
引用
收藏
页码:271 / 292
页数:22
相关论文
共 49 条
[1]  
[Anonymous], 2014, Advances in Neural Information Processing Systems
[2]  
[Anonymous], 2007, P ANN M ASS COMP LIN
[3]  
[Anonymous], 2017, ABS171205690 CORR
[4]  
Artetxe M, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P3632
[5]  
Arthur P., 2016, P 2016 C EMP METH NA, P1557, DOI 10.18653/v1/D16-1162
[6]  
Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[7]  
Bojar, 2016, P 1 C MACH TRANSL SH, DOI [DOI 10.18653/V1/W16-2301, 10.18653/v1/W16-2301]
[8]  
Bojar O, 2014, LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P3550
[9]  
Cho K., 2014, P SSST 8 8 WORKSH SY
[10]  
Crego J., 2016, ABS161005540