Information Dropping Data Augmentation for Machine Translation Quality Estimation

被引:1
作者
Li, Shuo [1 ]
Bi, Xiaojun [2 ,3 ]
Liu, Tao [4 ]
Chen, Zheng [2 ,3 ]
机构
[1] Harbin Engn Univ, Coll Informat & Commun Engn, Harbin 150001, Peoples R China
[2] Minzu Univ China, Key Lab Ethn Language Intelligent Anal & Secur Gov, Beijing 100086, Peoples R China
[3] Minzu Univ China, Sch Informat Engn, Beijing 100081, Peoples R China
[4] Harbin Engn Univ, Coll Comp Sci & Technol, Harbin 150001, Peoples R China
基金
中国国家自然科学基金;
关键词
Machine translation; Data augmentation; Data models; Computational modeling; Training data; Estimation; Training; information entropy; machine translation; pseudo label; quality estimation;
D O I
10.1109/TASLP.2024.3380996
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Machine translation quality estimation (QE) refers to the quality assessment of machine translations without a given reference translation. Supervised QE models based on neural networks have achieved state-of-the-art results. But this method requires large-scale training data, which requires bilingual experts to create high-quality labels. This is often very costly. Therefore, we propose a sentence-level machine translation QE data augmentation method based on information dropping. Firstly, we calculate the subwords information of the target translation based on the conditional language model. Subsequently, some subwords in the target translation are randomly deleted or replaced. We obtain the pseudo quality score by calculating the remaining information. Finally, the original and augmented data are combined to train the final model. This pseudo-data generation method based on information dropping strategy enables us to obtain more faithful and diverse training samples without requiring additional corpus resources. Experimental results show that we improve the correlation with human judgment by an average of 5.96% in the seven translation directions of the MLQE-PE dataset, while improving the model's robustness to low adequacy samples. In addition, the method does not require any modifications to the model architecture.
引用
收藏
页码:2112 / 2124
页数:13
相关论文
共 50 条
  • [31] Implicit Semantic Data Augmentation for Hand Pose Estimation
    Seo, Kyeongeun
    Cho, Hyeonjoong
    Choi, Daewoong
    Park, Ju-Derk
    IEEE ACCESS, 2022, 10 : 84680 - 84688
  • [32] Measuring the Impact of Spelling Errors on the Quality of Machine Translation
    Galinskaya, Irina
    Gusev, Valentin
    Meshcheryakova, Elena
    Shmatova, Mariya
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2683 - 2689
  • [33] Least Information Spectral GAN With Time-Series Data Augmentation for Industrial IoT
    Seon, Joonho
    Lee, Seongwoo
    Sun, Young Ghyu
    Kim, Soo Hyun
    Kim, Dong In
    Kim, Jin Young
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2025, 9 (01): : 757 - 769
  • [34] Improving Many-to-Many Neural Machine Translation via Selective and Aligned Online Data Augmentation
    Zhang, Weitai
    Dai, Lirong
    Liu, Junhua
    Wang, Shijin
    APPLIED SCIENCES-BASEL, 2023, 13 (06):
  • [35] Quality Estimation for Machine Translation with Multi-granularity Interaction
    Tian, Ke
    Zhang, Jiajun
    MACHINE TRANSLATION, CCMT 2020, 2020, 1328 : 55 - 65
  • [36] Paraphrase Based Data Augmentation For Chinese-English Medical Machine Translation
    An Bo
    Long Congjun
    JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2022, 44 (01) : 118 - 126
  • [37] A Survey of Orthographic Information in Machine Translation
    Chakravarthi B.R.
    Rani P.
    Arcan M.
    McCrae J.P.
    SN Computer Science, 2021, 2 (4)
  • [38] Machine translation and fair access to information
    Nurminen, Mary
    Koponen, Maarit
    TRANSLATION SPACES, 2020, 9 (01) : 150 - 169
  • [39] Incorporating Syntactic Knowledge in Neural Quality Estimation for Machine Translation
    Ye, Na
    Wang, Yuanyuan
    Cai, Dongfeng
    MACHINE TRANSLATION, CCMT 2019, 2019, 1104 : 23 - 34
  • [40] Low-Resource Translation Quality Estimation for Estonian
    Yankovskaya, Elizaveta
    Fishel, Mark
    HUMAN LANGUAGE TECHNOLOGIES - THE BALTIC PERSPECTIVE, BALTIC HLT 2018, 2018, 307 : 175 - 182