Machine Reading Comprehension Model for Low-Resource Languages and Experimenting on Vietnamese

被引:1
|
作者
Bach Hoang Tien Nguyen [1 ]
Dung Manh Nguyen [1 ]
Trang Thi Thu Nguyen [1 ]
机构
[1] Hanoi Univ Sci & Technol, Sch Informat & Commun Technol, Hanoi, Vietnam
关键词
Low resource languages; Translated datasets; Pre-train layer;
D O I
10.1007/978-3-031-08530-7_31
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machine Reading Comprehension (MRC) is a challenging task in natural language processing. In recent times, many large datasets and good models are public for this task, but most of them are for English only. Building a good MRC dataset always takes much effort, this paper proposes a method, called UtlTran, to improve the MRC quality for low-resource languages. In this method, all available MRC English datasets are collected and translated into the target language with some context-reducing strategies for better results. Tokens of question and context are initialized word representations using a word embedding model. They are then pre-trained with the MRC model with the translated dataset for the specific low-resource language. Finally, a small manual MRC dataset is used to continue fine-tuning the model to get the best results. The experimental results on the Vietnamese language show that the best word embedding model for this task is a multilingual one - XLM-R. Whereas, the best translation strategy is to reduce context by answer positions. The proposed model gives the best quality, i.e. F1 = 88.2% and Exact Match (EM) =71.8%, on the UIT-ViQuAD dataset, compared to the state-of-the-art models.
引用
收藏
页码:370 / 381
页数:12
相关论文
共 50 条
  • [11] Extremely low-resource neural machine translation for Asian languages
    Rubino, Raphael
    Marie, Benjamin
    Dabre, Raj
    Fujita, Atushi
    Utiyama, Masao
    Sumita, Eiichiro
    MACHINE TRANSLATION, 2020, 34 (04) : 347 - 382
  • [12] Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages
    Goyal, Vikrant
    Kumar, Sourav
    Sharma, Dipti Misra
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): STUDENT RESEARCH WORKSHOP, 2020, : 162 - 168
  • [13] Can LLMs Augment Low-Resource Reading Comprehension Datasets? Opportunities and Challenges
    Samuel, Vinay
    Aynaou, Houda
    Chowdhury, Arijit Ghosh
    Ramanan, Karthik Venkat
    Chadha, Aman
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 4: STUDENT RESEARCH WORKSHOP, 2024, : 325 - 335
  • [14] Benchmarking Neural and Statistical Machine Translation on Low-Resource African Languages
    Duh, Kevin
    McNamee, Paul
    Post, Matt
    Thompson, Brian
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2667 - 2675
  • [15] An Analysis of Massively Multilingual Neural Machine Translation for Low-Resource Languages
    Mueller, Aaron
    Nicolai, Garrett
    McCarthy, Arya D.
    Lewis, Dylan
    Wu, Winston
    Yarowsky, David
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3710 - 3718
  • [16] Towards a Low-Resource Neural Machine Translation for Indigenous Languages in Canada
    Ngoc Tan Le
    Sadat, Fatiha
    TRAITEMENT AUTOMATIQUE DES LANGUES, 2021, 62 (03): : 39 - 63
  • [17] Neural machine translation for low-resource languages without parallel corpora
    Karakanta, Alina
    Dehdari, Jon
    van Genabith, Josef
    MACHINE TRANSLATION, 2018, 32 (1-2) : 167 - 189
  • [18] Voice Activation for Low-Resource Languages
    Kolesau, Aliaksei
    Sesok, Dmitrij
    APPLIED SCIENCES-BASEL, 2021, 11 (14):
  • [19] Conversational Machine Reading Comprehension for Vietnamese Healthcare Texts
    Luu, Son T.
    Bui, Mao Nguyen
    Nguyen, Loi Duc
    Tran, Khiem Vinh
    Nguyen, Kiet Van
    Nguyen, Ngan Luu-Thuy
    ADVANCES IN COMPUTATIONAL COLLECTIVE INTELLIGENCE (ICCCI 2021), 2021, 1463 : 546 - 558
  • [20] A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages
    Vania, Clara
    Kementchedjhieva, Yova
    Sogaard, Anders
    Lopez, Adam
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 1105 - 1116