Augmenting training data with syntactic phrasal-segments in low-resource neural machine translation

被引:1
|
作者
Gupta, Kamal Kumar [1 ]
Sen, Sukanta [1 ]
Haque, Rejwanul [2 ]
Ekbal, Asif [1 ]
Bhattacharyya, Pushpak [1 ]
Way, Andy [2 ]
机构
[1] Indian Inst Technol Patna, Dept Comp Sci & Engn, Patna, Bihar, India
[2] Dublin City Univ, ADAPT Ctr, Sch Comp, Dublin, Ireland
关键词
Neural machine translation; Low-resource neural machine translation; Data augmentation; Syntactic phrase augmentation;
D O I
10.1007/510590-021-09290-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Neural machine translation (NMT) has emerged as a preferred alternative to the previous mainstream statistical machine translation (SMT) approaches largely due to its ability to produce better translations. The NMT training is often characterized as data hungry since a lot of training data, in the order of a few million parallel sentences, is generally required. This is indeed a bottleneck for the under-resourced languages that lack the availability of such resources. The researchers in machine translation (MT) have tried to solve the problem of data sparsity by augmenting the training data using different strategies. In this paper, we propose a generalized linguistically motivated data augmentation approach for NMT taking low-resource translation into consideration. The proposed method operates by generating source-target phrasal segments from an authentic parallel corpus, whose target counterparts are linguistic phrases extracted from the syntactic parse trees of the target-side sentences. We augment the authentic training corpus with the parser generated phrasal-segments, and investigate the efficacy of our proposed strategy in low-resource scenarios. To this end, we carried out experiments with resource-poor language pairs, viz. Hindi-to-English, Malayalam-to-English, and Telugu-to-English, considering the three state-of-the-art NMT paradigms, viz. attention-based recurrent neural network (Bandanau et al., 2015), Google Transformer (Vaswani et al. 2017) and convolution sequence-to-sequence (Gehring et al. 2017) neural network models. The MT systems built on the training data prepared with our data augmentation strategy significantly surpassed the state-of-the-art NMT systems with large margins in all three translation tasks. Further, we tested our approach along with back-translation (Sennrich et al. 2016a), and found these to be complementary to each other. This joint approach has turned out to be the best-performing one in our low-resource experimental settings.
引用
收藏
页数:25
相关论文
共 50 条
  • [1] Augmenting training data with syntactic phrasal-segments in low-resource neural machine translation
    Gupta, Kamal Kumar
    Sen, Sukanta
    Haque, Rejwanul
    Ekbal, Asif
    Bhattacharyya, Pushpak
    Way, Andy
    MACHINE TRANSLATION, 2021, 35 (04) : 661 - 685
  • [2] Augmenting Training Data for Low-Resource Neural Machine Translation via Bilingual Word Embeddings and BERT Language Modelling
    Ramesh, Akshai
    Uhana, Haque Usuf
    Parthasarathy, Venkatesh Balavadhani
    Haque, Rejwanul
    Way, Andy
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [3] Pre-Training on Mixed Data for Low-Resource Neural Machine Translation
    Zhang, Wenbo
    Li, Xiao
    Yang, Yating
    Dong, Rui
    INFORMATION, 2021, 12 (03)
  • [4] Data Augmentation for Low-Resource Neural Machine Translation
    Fadaee, Marzieh
    Bisazza, Arianna
    Monz, Christof
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 2, 2017, : 567 - 573
  • [5] Handling Syntactic Divergence in Low-resource Machine Translation
    Zhou, Chunting
    Ma, Xuezhe
    Hu, Junjie
    Neubig, Graham
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 1388 - 1394
  • [6] A Survey on Low-Resource Neural Machine Translation
    Wang, Rui
    Tan, Xu
    Luo, Renqian
    Qin, Tao
    Liu, Tie-Yan
    PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 4636 - 4643
  • [7] A Survey on Low-resource Neural Machine Translation
    Li H.-Z.
    Feng C.
    Huang H.-Y.
    Huang, He-Yan (hhy63@bit.edu.cn), 1600, Science Press (47): : 1217 - 1231
  • [8] Transformers for Low-resource Neural Machine Translation
    Gezmu, Andargachew Mekonnen
    Nuernberger, Andreas
    ICAART: PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE - VOL 1, 2022, : 459 - 466
  • [9] Rethinking the Exploitation of Monolingual Data for Low-Resource Neural Machine Translation
    Pang, Jianhui
    Yang, Baosong
    Wong, Derek Fai
    Wan, Yu
    Liu, Dayiheng
    Chao, Lidia Sam
    Xie, Jun
    COMPUTATIONAL LINGUISTICS, 2023, 50 (01) : 25 - 47
  • [10] A Diverse Data Augmentation Strategy for Low-Resource Neural Machine Translation
    Li, Yu
    Li, Xiao
    Yang, Yating
    Dong, Rui
    INFORMATION, 2020, 11 (05)