Information Dropping Data Augmentation for Machine Translation Quality Estimation

被引:1
作者
Li, Shuo [1 ]
Bi, Xiaojun [2 ,3 ]
Liu, Tao [4 ]
Chen, Zheng [2 ,3 ]
机构
[1] Harbin Engn Univ, Coll Informat & Commun Engn, Harbin 150001, Peoples R China
[2] Minzu Univ China, Key Lab Ethn Language Intelligent Anal & Secur Gov, Beijing 100086, Peoples R China
[3] Minzu Univ China, Sch Informat Engn, Beijing 100081, Peoples R China
[4] Harbin Engn Univ, Coll Comp Sci & Technol, Harbin 150001, Peoples R China
基金
中国国家自然科学基金;
关键词
Machine translation; Data augmentation; Data models; Computational modeling; Training data; Estimation; Training; information entropy; machine translation; pseudo label; quality estimation;
D O I
10.1109/TASLP.2024.3380996
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Machine translation quality estimation (QE) refers to the quality assessment of machine translations without a given reference translation. Supervised QE models based on neural networks have achieved state-of-the-art results. But this method requires large-scale training data, which requires bilingual experts to create high-quality labels. This is often very costly. Therefore, we propose a sentence-level machine translation QE data augmentation method based on information dropping. Firstly, we calculate the subwords information of the target translation based on the conditional language model. Subsequently, some subwords in the target translation are randomly deleted or replaced. We obtain the pseudo quality score by calculating the remaining information. Finally, the original and augmented data are combined to train the final model. This pseudo-data generation method based on information dropping strategy enables us to obtain more faithful and diverse training samples without requiring additional corpus resources. Experimental results show that we improve the correlation with human judgment by an average of 5.96% in the seven translation directions of the MLQE-PE dataset, while improving the model's robustness to low adequacy samples. In addition, the method does not require any modifications to the model architecture.
引用
收藏
页码:2112 / 2124
页数:13
相关论文
共 50 条
  • [21] MACHINE TRANSLATION BASED DATA AUGMENTATION FOR CANTONESE KEYWORD SPOTTING
    Huang, Guangpu
    Gorin, Arseniy
    Gauvain, Jean-Luc
    Lamel, Lori
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 6020 - 6024
  • [22] Robust Data Augmentation for Neural Machine Translation through EVALNET
    Park, Yo-Han
    Choi, Yong-Seok
    Yun, Seung
    Kim, Sang-Hun
    Lee, Kong-Joo
    MATHEMATICS, 2023, 11 (01)
  • [23] Syntax-Aware Data Augmentation for Neural Machine Translation
    Duan, Sufeng
    Zhao, Hai
    Zhang, Dongdong
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2988 - 2999
  • [24] Predictor-Estimator: Neural Quality Estimation Based on Target Word Prediction for Machine Translation
    Kim, Hyun
    Jung, Hun-Young
    Kwon, Hongseok
    Lee, Jong-Hyeok
    Na, Seung-Hoon
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2017, 17 (01)
  • [25] Unsupervised Machine Translation Quality Estimation in Black-Box Setting
    Huang, Hui
    Di, Hui
    Xu, Jin'an
    Ouchi, Kazushige
    Chen, Yufeng
    MACHINE TRANSLATION, CCMT 2020, 2020, 1328 : 24 - 36
  • [26] An efficient and user-friendly tool for machine translation quality estimation
    Shah, Kashif
    Turchi, Marco
    Specia, Lucia
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 3560 - 3564
  • [27] A Scenario-Generic Neural Machine Translation Data Augmentation Method
    Liu, Xiner
    He, Jianshu
    Liu, Mingzhe
    Yin, Zhengtong
    Yin, Lirong
    Zheng, Wenfeng
    ELECTRONICS, 2023, 12 (10)
  • [28] Random Concatenation: A Simple Data Augmentation Method for Neural Machine Translation
    Xiao, Nini
    Zhang, Huaao
    Jin, Chang
    Duan, Xiangyu
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT I, 2022, 13551 : 69 - 80
  • [29] Estimating word-level quality of statistical machine translation output using monolingual information alone
    Tezcan, Arda
    Hoste, Veronique
    Macken, Lieve
    NATURAL LANGUAGE ENGINEERING, 2020, 26 (01) : 73 - 94
  • [30] Efficient Data Augmentation via lexical matching for boosting performance on Statistical Machine Translation for Indic and a Low-resource language
    Saxena, Shefali
    Gupta, Ayush
    Daniel, Philemon
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (24) : 64255 - 64269