KenSwQuAD-A Question Answering Dataset for Swahili Low-resource Language

被引:2
|
作者
Wanjawa, Barack W. [1 ]
Wanzare, Lilian D. A. [2 ]
Indede, Florence [2 ]
Mconyango, Owen [2 ]
Muchemi, Lawrence [1 ]
Ombui, Edward [3 ]
机构
[1] Univ Nairobi, POB 30197, Nairobi 00100, Kenya
[2] Maseno Univ, POB 333, Maseno, Kenya
[3] Africa Nazarene Univ, POB 53067, Nairobi 00200, Kenya
关键词
Swahili; question answer; low-resource languages;
D O I
10.1145/3578553
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The need for question-answering (QA) datasets in low-resource languages is the motivation of this research, leading to the development of the Kencorpus Swahili Question Answering Dataset (KenSwQuAD). This dataset is annotated from raw story texts of Swahili, a low-resource language that is predominantly spoken in eastern Africa and in other parts of the world. Question-answering datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems. Machine learning systems need training data such as the gold-standard question-answering set developed in this research. The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project, a Kenyan languages corpus. The project annotated 1,445 texts from the total 2,585 texts with at least 5 QA pairs each, resulting in a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of the annotated texts confirmed that the QA pairs were all correctly annotated. A proof of concept on applying the set to the QA task confirmed that the dataset can be usable for such tasks. KenSwQuAD has also contributed to resourcing of the Swahili language.
引用
收藏
页数:20
相关论文
共 32 条
  • [21] AI-Based Assistance for Management of Oral Community Knowledge in Low-Resource and Colloquial Kannada Language
    Aparna, M.
    Srivatsa, Sharath
    Madhavan, G. Sai
    Dinesh, T. B.
    Srinivasa, Srinath
    BIG DATA ANALYTICS IN ASTRONOMY, SCIENCE, AND ENGINEERING, BDA 2023, 2024, 14516 : 3 - 16
  • [22] Low-Resource Language Processing Using Improved Deep Learning with Hunter-Prey Optimization Algorithm
    Al-Wesabi, Fahd N.
    Alshahrani, Hala J.
    Osman, Azza Elneil
    Abd Elhameed, Elmouez Samir
    MATHEMATICS, 2023, 11 (21)
  • [23] Understanding the Research Challenges in Low-Resource Language and Linking Bilingual News Articles in Multilingual News Archive
    Khan, Muzammil
    Ullah, Kifayat
    Alharbi, Yasser
    Alferaidi, Ali
    Alharbi, Talal Saad
    Yadav, Kusum
    Alsharabi, Naif
    Ahmad, Aakash
    APPLIED SCIENCES-BASEL, 2023, 13 (15):
  • [24] Cross-language Phoneme Mapping for Low-resource Languages: An Exploration of Benefits and Trade-offs
    Chibuye, Nick K.
    Rosenstock, Todd S.
    DeRenzi, Brian
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2623 - 2627
  • [25] Toward Low-Resource Languages Machine Translation: A Language-Specific Fine-Tuning With LoRA for Specialized Large Language Models
    Liang, Xiao
    Khaw, Yen-Min Jasmina
    Liew, Soung-Yue
    Tan, Tien-Ping
    Qin, Donghong
    IEEE ACCESS, 2025, 13 : 46616 - 46626
  • [26] Eliciting analogical reasoning from language models in retrieval-augmented translation under low-resource scenarios
    Wang, Liyan
    Wloka, Bartholomaus
    Lepage, Yves
    NEUROCOMPUTING, 2025, 630
  • [27] Multidimensional Affective Analysis for Low-Resource Languages: A Use Case with Guarani-Spanish Code-Switching Language
    Aguero-Torales, Marvin M.
    Lopez-Herrera, Antonio G.
    Vilares, David
    COGNITIVE COMPUTATION, 2023, 15 (04) : 1391 - 1406
  • [28] Cross-Lingual Transfer of Large Language Model by Visually-Derived Supervision Toward Low-Resource Languages
    Muraoka, Masayasu
    Bhattacharjee, Bishwaranjan
    Merler, Michele
    Blackwood, Graeme
    Li, Yulong
    Zhao, Yang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3637 - 3646
  • [29] Multidimensional Affective Analysis for Low-Resource Languages: A Use Case with Guarani-Spanish Code-Switching Language
    Marvin M. Agüero-Torales
    Antonio G. López-Herrera
    David Vilares
    Cognitive Computation, 2023, 15 : 1391 - 1406
  • [30] ChatGPT Label: Comparing the Quality of Human-Generated and LLM-Generated Annotations in Low-Resource Language NLP Tasks
    Nasution, Arbi Haza
    Onan, Aytug
    IEEE ACCESS, 2024, 12 : 71876 - 71900