Machine Reading Comprehension Model for Low-Resource Languages and Experimenting on Vietnamese

被引:1
|
作者
Bach Hoang Tien Nguyen [1 ]
Dung Manh Nguyen [1 ]
Trang Thi Thu Nguyen [1 ]
机构
[1] Hanoi Univ Sci & Technol, Sch Informat & Commun Technol, Hanoi, Vietnam
关键词
Low resource languages; Translated datasets; Pre-train layer;
D O I
10.1007/978-3-031-08530-7_31
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machine Reading Comprehension (MRC) is a challenging task in natural language processing. In recent times, many large datasets and good models are public for this task, but most of them are for English only. Building a good MRC dataset always takes much effort, this paper proposes a method, called UtlTran, to improve the MRC quality for low-resource languages. In this method, all available MRC English datasets are collected and translated into the target language with some context-reducing strategies for better results. Tokens of question and context are initialized word representations using a word embedding model. They are then pre-trained with the MRC model with the translated dataset for the specific low-resource language. Finally, a small manual MRC dataset is used to continue fine-tuning the model to get the best results. The experimental results on the Vietnamese language show that the best word embedding model for this task is a multilingual one - XLM-R. Whereas, the best translation strategy is to reduce context by answer positions. The proposed model gives the best quality, i.e. F1 = 88.2% and Exact Match (EM) =71.8%, on the UIT-ViQuAD dataset, compared to the state-of-the-art models.
引用
收藏
页码:370 / 381
页数:12
相关论文
共 50 条
  • [1] Effective Strategies for Low-Resource Reading Comprehension
    Jing, Yimin
    Xiong, Deyi
    2020 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2020), 2020, : 153 - 157
  • [2] Curriculum Learning Driven Domain Adaptation for Low-Resource Machine Reading Comprehension
    Zhang, Licheng
    Wang, Quan
    Xu, Benfeng
    Liu, Yi
    Mao, Zhendong
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2650 - 2654
  • [3] A Query-Parallel Machine Reading Comprehension Framework for Low-resource NER
    Zhang, Yuhao
    Wang, Yongliang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 2052 - 2065
  • [4] Domain adaptive multi-task transformer for low-resource machine reading comprehension
    Bai, Ziwei
    Wang, Baoxun
    Wang, Zongsheng
    Yuan, Caixia
    Wang, Xiaojie
    NEUROCOMPUTING, 2022, 509 : 46 - 55
  • [5] OCR Improves Machine Translation for Low-Resource Languages
    Ignat, Oana
    Maillard, Jean
    Chaudhary, Vishrav
    Guzman, Francisco
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 1164 - 1174
  • [6] Neural Machine Translation for Low-resource Languages: A Survey
    Ranathunga, Surangika
    Lee, En-Shiun Annie
    Skenduli, Marjana Prifti
    Shekhar, Ravi
    Alam, Mehreen
    Kaur, Rishemjit
    ACM COMPUTING SURVEYS, 2023, 55 (11)
  • [7] Introduction to the second issue on machine translation for low-resource languages
    Liu, Chao-Hong
    Karakanta, Alina
    Tong, Audrey N.
    Aulov, Oleg
    Soboroff, Ian M.
    Washington, Jonathan
    Zhao, Xiaobing
    MACHINE TRANSLATION, 2021, 35 (01) : 1 - 2
  • [8] Machine Translation in Low-Resource Languages by an Adversarial Neural Network
    Sun, Mengtao
    Wang, Hao
    Pasquine, Mark
    Hameed, Ibrahim A.
    APPLIED SCIENCES-BASEL, 2021, 11 (22):
  • [9] Introduction to the Special Issue on Machine Translation for Low-Resource Languages
    Liu, Chao-Hong
    Karakanta, Alina
    Tong, Audrey N.
    Aulov, Oleg
    Soboroff, Ian M.
    Washington, Jonathan
    Zhao, Xiaobing
    MACHINE TRANSLATION, 2020, 34 (04) : 247 - 249
  • [10] Neural Machine Translation of Low-Resource and Similar Languages with Backtranslation
    Przystupa, Michael
    Abdul-Mageed, Muhammad
    FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT 2019), VOL 3: SHARED TASK PAPERS, DAY 2, 2019, : 224 - 235