Neural machine translation of low-resource languages using SMT phrase pair injection

被引：20

作者：

Sen, Sukanta ^{[1
]}

Hasanuzzaman, Mohammed ^{[2
]}

Ekbal, Asif ^{[1
]}

Bhattacharyya, Pushpak ^{[1
]}

Way, Andy ^{[2
]}

机构：

[1] Indian Inst Technol Patna, Patna, Bihar, India

[2] Dublin City Univ, ADAPT Ctr, Dublin, Ireland

来源：

NATURAL LANGUAGE ENGINEERING | 2021年 / 27卷 / 03期

关键词：

Machine translation; Translation technology;

D O I：

10.1017/S1351324920000303

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Neural machine translation (NMT) has recently shown promising results on publicly available benchmark datasets and is being rapidly adopted in various production systems. However, it requires high-quality large-scale parallel corpus, and it is not always possible to have sufficiently large corpus as it requires time, money, and professionals. Hence, many existing large-scale parallel corpus are limited to the specific languages and domains. In this paper, we propose an effective approach to improve an NMT system in low-resource scenario without using any additional data. Our approach aims at augmenting the original training data by means of parallel phrases extracted from the original training data itself using a statistical machine translation (SMT) system. Our proposed approach is based on the gated recurrent unit (GRU) and transformer networks. We choose the Hindi-English, Hindi-Bengali datasets for Health, Tourism, and Judicial (only for Hindi-English) domains. We train our NMT models for 10 translation directions, each using only 5-23k parallel sentences. Experiments show the improvements in the range of 1.38-15.36 BiLingual Evaluation Understudy points over the baseline systems. Experiments show that transformer models perform better than GRU models in low-resource scenarios. In addition to that, we also find that our proposed method outperforms SMT-which is known to work better than the neural models in low-resource scenarios-for some translation directions. In order to further show the effectiveness of our proposed model, we also employ our approach to another interesting NMT task, for example, old-to-modern English translation, using a tiny parallel corpus of only 2.7K sentences. For this task, we use publicly available old-modern English text which is approximately 1000 years old. Evaluation for this task shows significant improvement over the baseline NMT.

引用

页码：271 / 292

页数：22

共 49 条

[1]

[Anonymous], 2014, Advances in Neural Information Processing Systems

[2]

[Anonymous], 2007, P ANN M ASS COMP LIN

[3]

[Anonymous], 2017, ABS171205690 CORR

[4]

Artetxe M, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P3632

[5]

Arthur P., 2016, P 2016 C EMP METH NA, P1557, DOI 10.18653/v1/D16-1162

[6]

Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473

[7]

Bojar, 2016, P 1 C MACH TRANSL SH, DOI [DOI 10.18653/V1/W16-2301, 10.18653/v1/W16-2301]

[8]

Bojar O, 2014, LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P3550

[9]

Cho K., 2014, P SSST 8 8 WORKSH SY

[10]

Crego J., 2016, ABS161005540

← 1 2 3 4 5 →