Information Dropping Data Augmentation for Machine Translation Quality Estimation

被引：1

作者：

Li, Shuo ^{[1
]}

Bi, Xiaojun ^{[2
,3
]}

Liu, Tao ^{[4
]}

Chen, Zheng ^{[2
,3
]}

机构：

[1] Harbin Engn Univ, Coll Informat & Commun Engn, Harbin 150001, Peoples R China

[2] Minzu Univ China, Key Lab Ethn Language Intelligent Anal & Secur Gov, Beijing 100086, Peoples R China

[3] Minzu Univ China, Sch Informat Engn, Beijing 100081, Peoples R China

[4] Harbin Engn Univ, Coll Comp Sci & Technol, Harbin 150001, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

基金：

中国国家自然科学基金;

关键词：

Machine translation; Data augmentation; Data models; Computational modeling; Training data; Estimation; Training; information entropy; machine translation; pseudo label; quality estimation;

D O I：

10.1109/TASLP.2024.3380996

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Machine translation quality estimation (QE) refers to the quality assessment of machine translations without a given reference translation. Supervised QE models based on neural networks have achieved state-of-the-art results. But this method requires large-scale training data, which requires bilingual experts to create high-quality labels. This is often very costly. Therefore, we propose a sentence-level machine translation QE data augmentation method based on information dropping. Firstly, we calculate the subwords information of the target translation based on the conditional language model. Subsequently, some subwords in the target translation are randomly deleted or replaced. We obtain the pseudo quality score by calculating the remaining information. Finally, the original and augmented data are combined to train the final model. This pseudo-data generation method based on information dropping strategy enables us to obtain more faithful and diverse training samples without requiring additional corpus resources. Experimental results show that we improve the correlation with human judgment by an average of 5.96% in the seven translation directions of the MLQE-PE dataset, while improving the model's robustness to low adequacy samples. In addition, the method does not require any modifications to the model architecture.

引用

页码：2112 / 2124

页数：13

共 50 条

[21] MACHINE TRANSLATION BASED DATA AUGMENTATION FOR CANTONESE KEYWORD SPOTTING
Huang, Guangpu
Gorin, Arseniy
Gauvain, Jean-Luc
Lamel, Lori
2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 6020 - 6024
[22] Robust Data Augmentation for Neural Machine Translation through EVALNET
Park, Yo-Han
Choi, Yong-Seok
Yun, Seung
Kim, Sang-Hun
Lee, Kong-Joo
MATHEMATICS, 2023, 11 (01)
[23] Syntax-Aware Data Augmentation for Neural Machine Translation
Duan, Sufeng
Zhao, Hai
Zhang, Dongdong
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2988 - 2999
[24] Predictor-Estimator: Neural Quality Estimation Based on Target Word Prediction for Machine Translation
Kim, Hyun
Jung, Hun-Young
Kwon, Hongseok
Lee, Jong-Hyeok
Na, Seung-Hoon
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2017, 17 (01)
[25] Unsupervised Machine Translation Quality Estimation in Black-Box Setting
Huang, Hui
Di, Hui
Xu, Jin'an
Ouchi, Kazushige
Chen, Yufeng
MACHINE TRANSLATION, CCMT 2020, 2020, 1328 : 24 - 36
[26] An efficient and user-friendly tool for machine translation quality estimation
Shah, Kashif
Turchi, Marco
Specia, Lucia
LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 3560 - 3564
[27] A Scenario-Generic Neural Machine Translation Data Augmentation Method
Liu, Xiner
He, Jianshu
Liu, Mingzhe
Yin, Zhengtong
Yin, Lirong
Zheng, Wenfeng
ELECTRONICS, 2023, 12 (10)
[28] Random Concatenation: A Simple Data Augmentation Method for Neural Machine Translation
Xiao, Nini
Zhang, Huaao
Jin, Chang
Duan, Xiangyu
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT I, 2022, 13551 : 69 - 80
[29] Estimating word-level quality of statistical machine translation output using monolingual information alone
Tezcan, Arda
Hoste, Veronique
Macken, Lieve
NATURAL LANGUAGE ENGINEERING, 2020, 26 (01) : 73 - 94
[30] Efficient Data Augmentation via lexical matching for boosting performance on Statistical Machine Translation for Indic and a Low-resource language
Saxena, Shefali
Gupta, Ayush
Daniel, Philemon
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (24) : 64255 - 64269

← 1 2 3 4 5 →