Evaluating Pre-training Objectives for Low-Resource Translation into Morphologically Rich Languages

被引:0
|
作者
Dhar, Prajit [1 ]
Bisazza, Arianna [1 ]
van Noord, Gertjan [1 ]
机构
[1] Univ Groningen, Ctr Language & Cognit Groningen CLCG, Groningen, Netherlands
关键词
low resource nmt; morphology; inflection;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The scarcity of parallel data is a major limitation for Neural Machine Translation (NMT) systems, in particular for translation into morphologically rich languages (MRLs). An important way to overcome the lack of parallel data is to leverage target monolingual data, which is typically more abundant and easier to collect. We evaluate a number of techniques to achieve this, ranging from back-translation to random token masking, on the challenging task of translating English into four typologically diverse MRLs, under low-resource settings. Additionally, we introduce Inflection Pre-Training (or PT-Inflect), a novel pre-training objective whereby the NMT system is pre-trained on the task of re-inflecting lemmatized target sentences before being trained on standard source-to-target language translation. We conduct our evaluation on four typologically diverse target MRLs, and find that PT-Inflect surpasses NMT systems trained only on parallel data. While PT-Inflect is outperformed by back-translation overall, combining the two techniques leads to gains in some of the evaluated language pairs.
引用
收藏
页码:4933 / 4943
页数:11
相关论文
共 50 条
  • [41] How to choose the best pivot language for automatic translation of low-resource languages
    Paul, Michael
    Finch, Andrew
    Sumita, Eiichrio
    ACM Transactions on Asian Language Information Processing, 2013, 12 (04):
  • [42] Incident-Driven Machine Translation and Name Tagging for Low-resource Languages
    Hermjakob, Ulf
    Li, Qiang
    Marcu, Daniel
    May, Jonathan
    Mielke, Sebastian J.
    Pourdamghani, Nima
    Pust, Michael
    Shi, Xing
    Knight, Kevin
    Levinboim, Tomer
    Murray, Kenton
    Chiang, David
    Zhang, Boliang
    Pan, Xiaoman
    Lu, Di
    Lin, Ying
    Ji, Heng
    MACHINE TRANSLATION, 2018, 32 (1-2) : 59 - 89
  • [43] Mismatching-aware unsupervised translation quality estimation for low-resource languages
    Azadi, Fatemeh
    Faili, Heshaam
    Dousti, Mohammad Javad
    LANGUAGE RESOURCES AND EVALUATION, 2024, 58 (04) : 1207 - 1231
  • [44] Multilingual neural machine translation for low-resource languages by twinning important nodes
    Qorbani, Abouzar
    Ramezani, Reza
    Baraani, Ahmad
    Kazemi, Arefeh
    NEUROCOMPUTING, 2025, 634
  • [45] DRA: dynamic routing attention for neural machine translation with low-resource languages
    Wang, Zhenhan
    Song, Ran
    Yu, Zhengtao
    Mao, Cunli
    Gao, Shengxiang
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024,
  • [46] MELODI at SemEval-2023 Task 3: In-domain Pre-training for Low-resource Classification of News Articles
    Devatine, Nicolas
    Mueller, Philippe
    Braud, Chloe
    17TH INTERNATIONAL WORKSHOP ON SEMANTIC EVALUATION, SEMEVAL-2023, 2023, : 108 - 113
  • [47] Voice Activation for Low-Resource Languages
    Kolesau, Aliaksei
    Sesok, Dmitrij
    APPLIED SCIENCES-BASEL, 2021, 11 (14):
  • [48] The RACAI Speech Translation System Challenges of morphologically rich languages
    Tufis, Dan
    Boros, Tiberiu
    Dumitrescu, Stefan Daniel
    2013 7TH CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN - COMPUTER DIALOGUE (SPED), 2013,
  • [49] Simulated Multiple Reference Training Improves Low-Resource Machine Translation
    Khayrallah, Huda
    Thompson, Brian
    Post, Matt
    Koehn, Philipp
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 82 - 89
  • [50] On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages
    Chen, Fuxiang
    Fard, Fatemeh H.
    Lo, David
    Bryksin, Timofey
    30TH IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC 2022), 2022, : 401 - 412