Augmenting training data with syntactic phrasal-segments in low-resource neural machine translation

被引:1
|
作者
Gupta, Kamal Kumar [1 ]
Sen, Sukanta [1 ]
Haque, Rejwanul [2 ]
Ekbal, Asif [1 ]
Bhattacharyya, Pushpak [1 ]
Way, Andy [2 ]
机构
[1] Indian Inst Technol Patna, Dept Comp Sci & Engn, Patna, Bihar, India
[2] Dublin City Univ, ADAPT Ctr, Sch Comp, Dublin, Ireland
关键词
Neural machine translation; Low-resource neural machine translation; Data augmentation; Syntactic phrase augmentation;
D O I
10.1007/510590-021-09290-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Neural machine translation (NMT) has emerged as a preferred alternative to the previous mainstream statistical machine translation (SMT) approaches largely due to its ability to produce better translations. The NMT training is often characterized as data hungry since a lot of training data, in the order of a few million parallel sentences, is generally required. This is indeed a bottleneck for the under-resourced languages that lack the availability of such resources. The researchers in machine translation (MT) have tried to solve the problem of data sparsity by augmenting the training data using different strategies. In this paper, we propose a generalized linguistically motivated data augmentation approach for NMT taking low-resource translation into consideration. The proposed method operates by generating source-target phrasal segments from an authentic parallel corpus, whose target counterparts are linguistic phrases extracted from the syntactic parse trees of the target-side sentences. We augment the authentic training corpus with the parser generated phrasal-segments, and investigate the efficacy of our proposed strategy in low-resource scenarios. To this end, we carried out experiments with resource-poor language pairs, viz. Hindi-to-English, Malayalam-to-English, and Telugu-to-English, considering the three state-of-the-art NMT paradigms, viz. attention-based recurrent neural network (Bandanau et al., 2015), Google Transformer (Vaswani et al. 2017) and convolution sequence-to-sequence (Gehring et al. 2017) neural network models. The MT systems built on the training data prepared with our data augmentation strategy significantly surpassed the state-of-the-art NMT systems with large margins in all three translation tasks. Further, we tested our approach along with back-translation (Sennrich et al. 2016a), and found these to be complementary to each other. This joint approach has turned out to be the best-performing one in our low-resource experimental settings.
引用
收藏
页数:25
相关论文
共 50 条
  • [41] Efficient Low-Resource Neural Machine Translation with Reread and Feedback Mechanism
    Yu, Zhiqiang
    Yu, Zhengtao
    Guo, Junjun
    Huang, Yuxin
    Wen, Yonghua
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (03)
  • [42] Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data
    Tonja, Atnafu Lambebo
    Kolesnikova, Olga
    Gelbukh, Alexander
    Sidorov, Grigori
    APPLIED SCIENCES-BASEL, 2023, 13 (02):
  • [43] Hierarchical Transfer Learning Architecture for Low-Resource Neural Machine Translation
    Luo, Gongxu
    Yang, Yating
    Yuan, Yang
    Chen, Zhanheng
    Ainiwaer, Aizimaiti
    IEEE ACCESS, 2019, 7 : 154157 - 154166
  • [44] Enhancing distant low-resource neural machine translation with semantic pivot
    Zhu, Enchang
    Huang, Yuxin
    Xian, Yantuan
    Zhu, Junguo
    Gao, Minghu
    Yu, Zhiqiang
    ALEXANDRIA ENGINEERING JOURNAL, 2025, 116 : 633 - 643
  • [45] Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation
    Liu, Zihan
    Winata, Genta Indra
    Fung, Pascale
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2706 - 2718
  • [46] Character-Aware Low-Resource Neural Machine Translation with Weight Sharing and Pre-training
    Cao, Yichao
    Li, Miao
    Feng, Tao
    Wang, Rujing
    CHINESE COMPUTATIONAL LINGUISTICS, CCL 2019, 2019, 11856 : 321 - 333
  • [47] Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation
    Mao, Zhuoyuan
    Chu, Chenhui
    Kurohashi, Sadao
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (04)
  • [48] Translation Memories as Baselines for Low-Resource Machine Translation
    Knowles, Rebecca
    Littell, Patrick
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6759 - 6767
  • [49] Keeping Models Consistent between Pretraining and Translation for Low-Resource Neural Machine Translation
    Zhang, Wenbo
    Li, Xiao
    Yang, Yating
    Dong, Rui
    Luo, Gongxu
    FUTURE INTERNET, 2020, 12 (12): : 1 - 13
  • [50] Machine Translation into Low-resource Language Varieties
    Kumar, Sachin
    Anastasopoulos, Antonios
    Wintner, Shuly
    Tsvetkov, Yulia
    ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 110 - 121