Data augmentation for low-resource languages NMT guided by constrained sampling

被引:12
作者
Maimaiti, Mieradilijiang [1 ]
Liu, Yang [1 ,2 ]
Luan, Huanbo [1 ]
Sun, Maosong [1 ]
机构
[1] Tsinghua Univ, State Key Lab Intelligent Technol & Syst, Inst Artificial Intelligence, Dept Comp Sci & Technol,Beijing Natl Res Ctr Info, Beijing 100084, Peoples R China
[2] Beijing Acad Artificial Intelligence, Beijing Adv Innovat Ctr Language Resources, Beijing, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
artificial intelligence; constrained sampling; data augmentation; low-resource languages; natural language processing; neural machine translation;
D O I
10.1002/int.22616
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data augmentation (DA) is a ubiquitous approach for several text generation tasks. Intuitively, in the machine translation paradigm, especially in low-resource languages scenario, many DA methods have appeared. The most commonly used methods are building pseudocorpus by randomly sampling, omitting, or replacing some words in the text. However, previous approaches hardly guarantee the quality of augmented data. In this study, we try to augment the corpus by introducing a constrained sampling method. Additionally, we also build the evaluation framework to select higher quality data after augmentation. Namely, we use the discriminator submodel to mitigate syntactic and semantic errors to some extent. Experimental results show that our augmentation method consistently outperforms all the previous state-of-the-art methods on both small and large-scale corpora in eight language pairs from four corpora by 2.38-4.18 bilingual evaluation understudy points.
引用
收藏
页码:30 / 51
页数:22
相关论文
共 48 条
[1]  
Alvarez, 2018, ARXIV180903664
[2]  
Artetxe Mikel, 2017, 6 INT C LEARN REPR I
[3]  
Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]
[4]   A Teacher-Student Framework for Zero-Resource Neural Machine Translation [J].
Chen, Yun ;
Liu, Yang ;
Cheng, Yong ;
Li, Victor O. K. .
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, :1925-1935
[5]  
Cheng Y., 2017, JOINT TRAINING NEURA, P41
[6]  
Cho K., 2014, ARXIV14061078, DOI [DOI 10.3115/V1/D14-1179, 10.3115/v1/D14-1179]
[7]  
Chu Chenhui, 2017, P 55 ANN M ASS COMP
[8]   AutoAugment: Learning Augmentation Strategies from Data [J].
Cubuk, Ekin D. ;
Zoph, Barret ;
Mane, Dandelion ;
Vasudevan, Vijay ;
Le, Quoc V. .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :113-123
[9]  
Currey A., 2017, P 2 C MACHINE TRANSL, P148, DOI DOI 10.18653/V1/W17-4715
[10]   Data Augmentation for Low-Resource Neural Machine Translation [J].
Fadaee, Marzieh ;
Bisazza, Arianna ;
Monz, Christof .
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 2, 2017, :567-573