Augmenting training data with syntactic phrasal-segments in low-resource neural machine translation

被引：1

作者：

Gupta, Kamal Kumar ^{[1
]}

Sen, Sukanta ^{[1
]}

Haque, Rejwanul ^{[2
]}

Ekbal, Asif ^{[1
]}

Bhattacharyya, Pushpak ^{[1
]}

Way, Andy ^{[2
]}

机构：

[1] Indian Inst Technol Patna, Dept Comp Sci & Engn, Patna, Bihar, India

[2] Dublin City Univ, ADAPT Ctr, Sch Comp, Dublin, Ireland

来源：

MACHINE TRANSLATION | 2021年

关键词：

Neural machine translation; Low-resource neural machine translation; Data augmentation; Syntactic phrase augmentation;

D O I：

10.1007/510590-021-09290-0

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Neural machine translation (NMT) has emerged as a preferred alternative to the previous mainstream statistical machine translation (SMT) approaches largely due to its ability to produce better translations. The NMT training is often characterized as data hungry since a lot of training data, in the order of a few million parallel sentences, is generally required. This is indeed a bottleneck for the under-resourced languages that lack the availability of such resources. The researchers in machine translation (MT) have tried to solve the problem of data sparsity by augmenting the training data using different strategies. In this paper, we propose a generalized linguistically motivated data augmentation approach for NMT taking low-resource translation into consideration. The proposed method operates by generating source-target phrasal segments from an authentic parallel corpus, whose target counterparts are linguistic phrases extracted from the syntactic parse trees of the target-side sentences. We augment the authentic training corpus with the parser generated phrasal-segments, and investigate the efficacy of our proposed strategy in low-resource scenarios. To this end, we carried out experiments with resource-poor language pairs, viz. Hindi-to-English, Malayalam-to-English, and Telugu-to-English, considering the three state-of-the-art NMT paradigms, viz. attention-based recurrent neural network (Bandanau et al., 2015), Google Transformer (Vaswani et al. 2017) and convolution sequence-to-sequence (Gehring et al. 2017) neural network models. The MT systems built on the training data prepared with our data augmentation strategy significantly surpassed the state-of-the-art NMT systems with large margins in all three translation tasks. Further, we tested our approach along with back-translation (Sennrich et al. 2016a), and found these to be complementary to each other. This joint approach has turned out to be the best-performing one in our low-resource experimental settings.

引用

页数：25

共 50 条

[21] Unsupervised Source Hierarchies for Low-Resource Neural Machine Translation
Currey, Anna
Heafield, Kenneth
RELEVANCE OF LINGUISTIC STRUCTURE IN NEURAL ARCHITECTURES FOR NLP, 2018, : 6 - 12
[22] Low-Resource Neural Machine Translation: A Systematic Literature Review
Yazar, Bilge Kagan
Sahin, Durmus Ozkan
Kilic, Erdal
IEEE ACCESS, 2023, 11 : 131775 - 131813
[23] Meta-Learning for Low-Resource Neural Machine Translation
Gu, Jiatao
Wang, Yong
Chen, Yun
Cho, Kyunghyun
Li, Victor O. K.
2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 3622 - 3631
[24] Neural Machine Translation of Low-Resource and Similar Languages with Backtranslation
Przystupa, Michael
Abdul-Mageed, Muhammad
FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT 2019), VOL 3: SHARED TASK PAPERS, DAY 2, 2019, : 224 - 235
[25] Extremely low-resource neural machine translation for Asian languages
Rubino, Raphael
Marie, Benjamin
Dabre, Raj
Fujita, Atushi
Utiyama, Masao
Sumita, Eiichiro
MACHINE TRANSLATION, 2020, 34 (04) : 347 - 382
[26] Revisiting Low-Resource Neural Machine Translation: A Case Study
Sennrich, Rico
Zhang, Biao
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 211 - 221
[27] Multimodal Neural Machine Translation for Low-resource Language Pairs using Synthetic Data
Chowdhury, Koel Dutta
Hasanuzzaman, Mohammed
Liu, Qun
DEEP LEARNING APPROACHES FOR LOW-RESOURCE NATURAL LANGUAGE PROCESSING (DEEPLO), 2018, : 33 - 42
[28] Transformer-Based Re-Ranking Model for Enhancing Contextual and Syntactic Translation in Low-Resource Neural Machine Translation
Javed, Arifa
Zan, Hongying
Mamyrbayev, Orken
Abdullah, Muhammad
Ahmed, Kanwal
Oralbekova, Dina
Dinara, Kassymova
Akhmediyarova, Ainur
ELECTRONICS, 2025, 14 (02):
[29] Does Masked Language Model Pre-training with Artificial Data Improve Low-resource Neural Machine Translation?
Tamura, Hiroto
Hirasawa, Tosho
Kim, Hwichan
Komachi, Mamoru
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 2216 - 2225
[30] Survey of Low-Resource Machine Translation
Haddow, Barry
Bawden, Rachel
Barone, Antonio Valerio Miceli
Helcl, Jindrich
Birch, Alexandra
COMPUTATIONAL LINGUISTICS, 2022, 48 (03) : 673 - 732

← 1 2 3 4 5 →