Non-Fluent Synthetic Target-Language Data Improve Neural Machine Translation

被引:1
|
作者
Sanchez-Cartagena, Victor M. [1 ]
Espla-Gomis, Miquel [1 ]
Perez-Ortiz, Juan Antonio [1 ]
Sanchez-Martinez, Felipe [1 ]
机构
[1] Univ Alacant, Dept Llenguatges & Sistemes Informat, Valencia 03690, Spain
关键词
Data augmentation; low-resource languages; machine translation; multi-task learning;
D O I
10.1109/TPAMI.2023.3333949
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
When the amount of parallel sentences available to train a neural machine translation is scarce, a common practice is to generate new synthetic training samples from them. A number of approaches have been proposed to produce synthetic parallel sentences that are similar to those in the parallel data available. These approaches work under the assumption that non-fluent target-side synthetic training samples can be harmful and may deteriorate translation performance. Even so, in this paper we demonstrate that synthetic training samples with non-fluent target sentences can improve translation performance if they are used in a multilingual machine translation framework as if they were sentences in another language. We conducted experiments on ten low-resource and four high-resource translation tasks and found out that this simple approach consistently improves translation performance as compared to state-of-the-art methods for generating synthetic training samples similar to those found in corpora. Furthermore, this improvement is independent of the size of the original training corpus, the resulting systems are much more robust against domain shift and produce less hallucinations.
引用
收藏
页码:837 / 850
页数:14
相关论文
共 50 条
  • [1] THE SOURCE-LANGUAGE VERSUS THE TARGET-LANGUAGE IN TRANSLATION
    LADMIRAL, JR
    REVUE D ESTHETIQUE, 1986, (12): : 33 - 42
  • [2] Source-Language Dictionaries Help Non-Expert Users to Enlarge Target-Language Dictionaries for Machine Translation
    Sanchez-Cartagena, Victor M.
    Espla-Gomis, Miquel
    Antonio Perez-Ortiz, Juan
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 3422 - 3429
  • [3] Exploiting Target Language Data for Neural Machine Translation Beyond Back Translation
    Reheman, Abudurexiti
    Lu, Yingfeng
    Ruan, Junhao
    Ma, Anxiang
    Zhang, Chunliang
    Xiao, Tong
    Zhu, Jingbo
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 12216 - 12228
  • [4] Using target-language information to train part-of-speech taggers for machine translation
    Sanchez-Martinez, Felipe
    Antonio Perez-Ortiz, Juan
    Forcada, Mikel L.
    MACHINE TRANSLATION, 2008, 22 (1-2) : 29 - 66
  • [5] Using Machine Translation to Provide Target-Language Edit Hints in Computer Aided Translation Based on Translation Memories
    Espla-Gomis, Miquel
    Sanchez-Martinez, Felipe
    Forcada, Mikel L.
    JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2015, 53 : 169 - 222
  • [6] Preordering using a Target-Language Parser via Cross-Language Syntactic Projection for Statistical Machine Translation
    Goto, Isao
    Utiyama, Masao
    Sumita, Eiichiro
    Kurohashi, Sadao
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2015, 14 (03)
  • [7] Speeding up target-language driven part-of-speech tagger training for machine translation
    Sanchez-Martinez, Felipe
    Perez-Ortiz, Juan Antonio
    Forcada, Mikel L.
    MICAI 2006: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4293 : 844 - +
  • [8] Patterns of language decline in non-fluent primary progressive aphasia
    Thompson, CK
    Ballard, KJ
    Tait, ME
    Weintraub, S
    Mesulam, M
    APHASIOLOGY, 1997, 11 (4-5) : 297 - 321
  • [9] Non-fluent aphasia in a polysynthetic language: five case studies
    Nedergaard, Johanne S. K.
    Martinez-Ferreiro, Silvia
    Fortescue, Michael D.
    Boye, Kasper
    APHASIOLOGY, 2020, 34 (06) : 654 - 673
  • [10] Language mixing patterns in a bilingual individual with non-fluent aphasia
    Lerman, Aviva
    Pazuelo, Lia
    Kizner, Lian
    Borodkin, Katy
    Goral, Mira
    APHASIOLOGY, 2019, 33 (09) : 1137 - 1153