Robust Data Augmentation for Neural Machine Translation through EVALNET

被引：4

作者：

Park, Yo-Han ^{[1
]}

Choi, Yong-Seok ^{[1
]}

Yun, Seung ^{[2
]}

Kim, Sang-Hun ^{[2
]}

Lee, Kong-Joo ^{[1
]}

机构：

[1] ChungNam Natl Univ, Dept Radio & Informat Commun Engn, 99 Daehak Ro, Daejeon 34134, South Korea

[2] Elect & Telecommun Res Inst ETRI, Artificial Intelligence Res Lab, 218 Gajeong Ro, Daejeon 34129, South Korea

来源：

MATHEMATICS | 2023年 / 11卷 / 01期

关键词：

neural machine translation; data augmentation; data reweighting; evalnet;

D O I：

10.3390/math11010123

中图分类号：

O1 [数学];

学科分类号：

0701 ; 070101 ;

摘要：

Since building Neural Machine Translation (NMT) systems requires a large parallel corpus, various data augmentation techniques have been adopted, especially for low-resource languages. In order to achieve the best performance through data augmentation, the NMT systems should be able to evaluate the quality of augmented data. Several studies have addressed data weighting techniques to assess data quality. The basic idea of data weighting adopted in previous studies is the loss value that a system calculates when learning from training data. The weight derived from the loss value of the data, through simple heuristic rules or neural models, can adjust the loss used in the next step of the learning process. In this study, we propose EvalNet, a data evaluation network, to assess parallel data of NMT. EvalNet exploits a loss value, a cross-attention map, and a semantic similarity between parallel data as its features. The cross-attention map is an encoded representation of cross-attention layers of Transformer, which is a base architecture of an NMT system. The semantic similarity is a cosine distance between two semantic embeddings of a source sentence and a target sentence. Owing to the parallelism of data, the combination of the cross-attention map and the semantic similarity proved to be effective features for data quality evaluation, besides the loss value. EvalNet is the first NMT data evaluator network that introduces the cross-attention map and the semantic similarity as its features. Through various experiments, we conclude that EvalNet is simple yet beneficial for robust training of an NMT system and outperforms the previous studies as a data evaluator.

引用

页数：15

共 18 条

[1] [Anonymous], 2018, ICML
[2] Caswell I., 2019, arXiv
[3] Chen P.J., 2019, P 6 WORKSHOP ASIAN T, P112, DOI DOI 10.18653/V1/D19-5213
[4] Conneau Alexis, 2020, P 58 ANN M ASS COMP
[5] Class-Balanced Loss Based on Effective Number of Samples
Cui, Yin
Jia, Menglin
Lin, Tsung-Yi
Song, Yang
Belongie, Serge
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 9260 - 9269
[6] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7] Feng FXY, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), P878
[8] Ghader H., 2017, P 8 INT JOINT C NATU, P30
[9] Hu Z., 2019, arXiv
[10] Khatri Jyotsana, 2020, P 28 INT C COMP LING, P4334, DOI 10

← 1 2 →