Evaluating Pre-training Objectives for Low-Resource Translation into Morphologically Rich Languages

被引:0
|
作者
Dhar, Prajit [1 ]
Bisazza, Arianna [1 ]
van Noord, Gertjan [1 ]
机构
[1] Univ Groningen, Ctr Language & Cognit Groningen CLCG, Groningen, Netherlands
关键词
low resource nmt; morphology; inflection;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The scarcity of parallel data is a major limitation for Neural Machine Translation (NMT) systems, in particular for translation into morphologically rich languages (MRLs). An important way to overcome the lack of parallel data is to leverage target monolingual data, which is typically more abundant and easier to collect. We evaluate a number of techniques to achieve this, ranging from back-translation to random token masking, on the challenging task of translating English into four typologically diverse MRLs, under low-resource settings. Additionally, we introduce Inflection Pre-Training (or PT-Inflect), a novel pre-training objective whereby the NMT system is pre-trained on the task of re-inflecting lemmatized target sentences before being trained on standard source-to-target language translation. We conduct our evaluation on four typologically diverse target MRLs, and find that PT-Inflect surpasses NMT systems trained only on parallel data. While PT-Inflect is outperformed by back-translation overall, combining the two techniques leads to gains in some of the evaluated language pairs.
引用
收藏
页码:4933 / 4943
页数:11
相关论文
共 50 条
  • [21] Fast and Efficient Multilingual Self-Supervised Pre-training for Low-Resource Speech Recognition
    Zhang, Zhilong
    Wang, Wei
    Qian, Yanmin
    INTERSPEECH 2023, 2023, : 2248 - 2252
  • [22] Introduction to the second issue on machine translation for low-resource languages
    Liu, Chao-Hong
    Karakanta, Alina
    Tong, Audrey N.
    Aulov, Oleg
    Soboroff, Ian M.
    Washington, Jonathan
    Zhao, Xiaobing
    MACHINE TRANSLATION, 2021, 35 (01) : 1 - 2
  • [23] Machine Translation in Low-Resource Languages by an Adversarial Neural Network
    Sun, Mengtao
    Wang, Hao
    Pasquine, Mark
    Hameed, Ibrahim A.
    APPLIED SCIENCES-BASEL, 2021, 11 (22):
  • [24] Neural Machine Translation of Low-Resource and Similar Languages with Backtranslation
    Przystupa, Michael
    Abdul-Mageed, Muhammad
    FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT 2019), VOL 3: SHARED TASK PAPERS, DAY 2, 2019, : 224 - 235
  • [25] Extremely low-resource neural machine translation for Asian languages
    Rubino, Raphael
    Marie, Benjamin
    Dabre, Raj
    Fujita, Atushi
    Utiyama, Masao
    Sumita, Eiichiro
    MACHINE TRANSLATION, 2020, 34 (04) : 347 - 382
  • [26] Introduction to the Special Issue on Machine Translation for Low-Resource Languages
    Liu, Chao-Hong
    Karakanta, Alina
    Tong, Audrey N.
    Aulov, Oleg
    Soboroff, Ian M.
    Washington, Jonathan
    Zhao, Xiaobing
    MACHINE TRANSLATION, 2020, 34 (04) : 247 - 249
  • [27] Pre-training via Leveraging Assisting Languages for Neural Machine Translation
    Song, Haiyue
    Dabre, Raj
    Mao, Zhuoyuan
    Cheng, Fei
    Kurohashi, Sadao
    Sumita, Eiichiro
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): STUDENT RESEARCH WORKSHOP, 2020, : 279 - 285
  • [28] Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data
    Kang, Yu
    Liu, Tianqiao
    Li, Hang
    Hao, Yang
    Ding, Wenbiao
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 10875 - 10883
  • [29] Wav2vec-S: Semi-Supervised Pre-Training for Low-Resource ASR
    Zhu, Han
    Wang, Li
    Wang, Jindong
    Cheng, Gaofeng
    Zhang, Pengyuan
    Yan, Yonghong
    INTERSPEECH 2022, 2022, : 4870 - 4874
  • [30] Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages
    Goyal, Vikrant
    Kumar, Sourav
    Sharma, Dipti Misra
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): STUDENT RESEARCH WORKSHOP, 2020, : 162 - 168