Machine Reading Comprehension Model for Low-Resource Languages and Experimenting on Vietnamese

被引:1
|
作者
Bach Hoang Tien Nguyen [1 ]
Dung Manh Nguyen [1 ]
Trang Thi Thu Nguyen [1 ]
机构
[1] Hanoi Univ Sci & Technol, Sch Informat & Commun Technol, Hanoi, Vietnam
关键词
Low resource languages; Translated datasets; Pre-train layer;
D O I
10.1007/978-3-031-08530-7_31
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machine Reading Comprehension (MRC) is a challenging task in natural language processing. In recent times, many large datasets and good models are public for this task, but most of them are for English only. Building a good MRC dataset always takes much effort, this paper proposes a method, called UtlTran, to improve the MRC quality for low-resource languages. In this method, all available MRC English datasets are collected and translated into the target language with some context-reducing strategies for better results. Tokens of question and context are initialized word representations using a word embedding model. They are then pre-trained with the MRC model with the translated dataset for the specific low-resource language. Finally, a small manual MRC dataset is used to continue fine-tuning the model to get the best results. The experimental results on the Vietnamese language show that the best word embedding model for this task is a multilingual one - XLM-R. Whereas, the best translation strategy is to reduce context by answer positions. The proposed model gives the best quality, i.e. F1 = 88.2% and Exact Match (EM) =71.8%, on the UIT-ViQuAD dataset, compared to the state-of-the-art models.
引用
收藏
页码:370 / 381
页数:12
相关论文
共 50 条
  • [31] GlotLID: Language Identification for Low-Resource Languages
    Kargaran, Amir Hossein
    Imani, Ayyoob
    Yvon, Francois
    Schuetze, Hinrich
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 6155 - 6218
  • [32] Discourse annotation guideline for low-resource languages
    Vargas, Francielle
    Schmeisser-Nieto, Wolfgang
    Rabinovich, Zohar
    Pardo, Thiago A. S.
    Benevenuto, Fabricio
    NATURAL LANGUAGE PROCESSING, 2025, 31 (02): : 700 - 743
  • [33] Extending Multilingual BERT to Low-Resource Languages
    Wang, Zihan
    Karthikeyan, K.
    Mayhew, Stephen
    Roth, Dan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 2649 - 2656
  • [34] Attention is all low-resource languages need
    Poupard, Duncan
    TRANSLATION STUDIES, 2024, 17 (02) : 424 - 427
  • [35] Sentence Extraction-Based Machine Reading Comprehension for Vietnamese
    Phong Nguyen-Thuan Do
    Nhat Duy Nguyen
    Tin Van Huynh
    Kiet Van Nguyen
    Anh Gia-Tuan Nguyen
    Ngan Luu-Thuy Nguyen
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, KSEM 2021, PT II, 2021, 12816 : 511 - 523
  • [36] Morpheme-Based Neural Machine Translation Models for Low-Resource Fusion Languages
    Gezmu, Andargachew Mekonnen
    Nuenberger, Andreas
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (09)
  • [37] Neural machine translation of low-resource languages using SMT phrase pair injection
    Sen, Sukanta
    Hasanuzzaman, Mohammed
    Ekbal, Asif
    Bhattacharyya, Pushpak
    Way, Andy
    NATURAL LANGUAGE ENGINEERING, 2021, 27 (03) : 271 - 292
  • [38] Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual Machine Translation
    Elbayad, Maha
    Sun, Anna
    Bhosale, Shruti
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 14237 - 14253
  • [39] Neighbors helping the poor: improving low-resource machine translation using related languages
    Pourdamghani, Nima
    Knight, Kevin
    MACHINE TRANSLATION, 2019, 33 (03) : 239 - 258
  • [40] Language Model Prior for Low-Resource Neural Machine Translation
    Baziotis, Christos
    Haddow, Barry
    Birch, Alexandra
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 7622 - 7634