Augmenting training data with syntactic phrasal-segments in low-resource neural machine translation

被引:1
|
作者
Gupta, Kamal Kumar [1 ]
Sen, Sukanta [1 ]
Haque, Rejwanul [2 ]
Ekbal, Asif [1 ]
Bhattacharyya, Pushpak [1 ]
Way, Andy [2 ]
机构
[1] Indian Inst Technol Patna, Dept Comp Sci & Engn, Patna, Bihar, India
[2] Dublin City Univ, ADAPT Ctr, Sch Comp, Dublin, Ireland
关键词
Neural machine translation; Low-resource neural machine translation; Data augmentation; Syntactic phrase augmentation;
D O I
10.1007/510590-021-09290-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Neural machine translation (NMT) has emerged as a preferred alternative to the previous mainstream statistical machine translation (SMT) approaches largely due to its ability to produce better translations. The NMT training is often characterized as data hungry since a lot of training data, in the order of a few million parallel sentences, is generally required. This is indeed a bottleneck for the under-resourced languages that lack the availability of such resources. The researchers in machine translation (MT) have tried to solve the problem of data sparsity by augmenting the training data using different strategies. In this paper, we propose a generalized linguistically motivated data augmentation approach for NMT taking low-resource translation into consideration. The proposed method operates by generating source-target phrasal segments from an authentic parallel corpus, whose target counterparts are linguistic phrases extracted from the syntactic parse trees of the target-side sentences. We augment the authentic training corpus with the parser generated phrasal-segments, and investigate the efficacy of our proposed strategy in low-resource scenarios. To this end, we carried out experiments with resource-poor language pairs, viz. Hindi-to-English, Malayalam-to-English, and Telugu-to-English, considering the three state-of-the-art NMT paradigms, viz. attention-based recurrent neural network (Bandanau et al., 2015), Google Transformer (Vaswani et al. 2017) and convolution sequence-to-sequence (Gehring et al. 2017) neural network models. The MT systems built on the training data prepared with our data augmentation strategy significantly surpassed the state-of-the-art NMT systems with large margins in all three translation tasks. Further, we tested our approach along with back-translation (Sennrich et al. 2016a), and found these to be complementary to each other. This joint approach has turned out to be the best-performing one in our low-resource experimental settings.
引用
收藏
页数:25
相关论文
共 50 条
  • [31] Simulated Multiple Reference Training Improves Low-Resource Machine Translation
    Khayrallah, Huda
    Thompson, Brian
    Post, Matt
    Koehn, Philipp
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 82 - 89
  • [32] Semantic Perception-Oriented Low-Resource Neural Machine Translation
    Wu, Nier
    Hou, Hongxu
    Li, Haoran
    Chang, Xin
    Jia, Xiaoning
    MACHINE TRANSLATION, CCMT 2021, 2021, 1464 : 51 - 62
  • [33] A Content Word Augmentation Method for Low-Resource Neural Machine Translation
    Li, Fuxue
    Zhao, Zhongchao
    Chi, Chuncheng
    Yan, Hong
    Zhang, Zhen
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, ICIC 2023, PT IV, 2023, 14089 : 720 - 731
  • [34] Understanding and Improving Low-Resource Neural Machine Translation with Shallow Features
    Sun, Yanming
    Liu, Xuebo
    Wong, Derek F.
    Lin, Yuchu
    Li, Bei
    Zhan, Runzhe
    Chao, Lidia S.
    Zhang, Min
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024, 2025, 15361 : 227 - 239
  • [35] Incremental Domain Adaptation for Neural Machine Translation in Low-Resource Settings
    Kalimuthu, Marimuthu
    Barz, Michael
    Sonntag, Daniel
    FOURTH ARABIC NATURAL LANGUAGE PROCESSING WORKSHOP (WANLP 2019), 2019, : 1 - 10
  • [36] Benchmarking Neural and Statistical Machine Translation on Low-Resource African Languages
    Duh, Kevin
    McNamee, Paul
    Post, Matt
    Thompson, Brian
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2667 - 2675
  • [37] An Analysis of Massively Multilingual Neural Machine Translation for Low-Resource Languages
    Mueller, Aaron
    Nicolai, Garrett
    McCarthy, Arya D.
    Lewis, Dylan
    Wu, Winston
    Yarowsky, David
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3710 - 3718
  • [38] Towards a Low-Resource Neural Machine Translation for Indigenous Languages in Canada
    Ngoc Tan Le
    Sadat, Fatiha
    TRAITEMENT AUTOMATIQUE DES LANGUES, 2021, 62 (03): : 39 - 63
  • [39] Regressing Word and Sentence Embeddings for Low-Resource Neural Machine Translation
    Unanue I.J.
    Borzeshi E.Z.
    Piccardi M.
    IEEE Transactions on Artificial Intelligence, 2023, 4 (03): : 450 - 463
  • [40] Neural machine translation for low-resource languages without parallel corpora
    Karakanta, Alina
    Dehdari, Jon
    van Genabith, Josef
    MACHINE TRANSLATION, 2018, 32 (1-2) : 167 - 189