KenSwQuAD-A Question Answering Dataset for Swahili Low-resource Language

被引:2
|
作者
Wanjawa, Barack W. [1 ]
Wanzare, Lilian D. A. [2 ]
Indede, Florence [2 ]
Mconyango, Owen [2 ]
Muchemi, Lawrence [1 ]
Ombui, Edward [3 ]
机构
[1] Univ Nairobi, POB 30197, Nairobi 00100, Kenya
[2] Maseno Univ, POB 333, Maseno, Kenya
[3] Africa Nazarene Univ, POB 53067, Nairobi 00200, Kenya
关键词
Swahili; question answer; low-resource languages;
D O I
10.1145/3578553
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The need for question-answering (QA) datasets in low-resource languages is the motivation of this research, leading to the development of the Kencorpus Swahili Question Answering Dataset (KenSwQuAD). This dataset is annotated from raw story texts of Swahili, a low-resource language that is predominantly spoken in eastern Africa and in other parts of the world. Question-answering datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems. Machine learning systems need training data such as the gold-standard question-answering set developed in this research. The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project, a Kenyan languages corpus. The project annotated 1,445 texts from the total 2,585 texts with at least 5 QA pairs each, resulting in a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of the annotated texts confirmed that the QA pairs were all correctly annotated. A proof of concept on applying the set to the QA task confirmed that the dataset can be usable for such tasks. KenSwQuAD has also contributed to resourcing of the Swahili language.
引用
收藏
页数:20
相关论文
共 32 条
  • [1] Model for Semantic Network Generation from Low Resource Languages as Applied to Question Answering - Case of Swahili
    Wanjawa, Barack
    Muchemi, Lawrence
    2021 IST-AFRICA CONFERENCE (IST-AFRICA), 2021,
  • [2] Cross-lingual embedding for cross-lingual question retrieval in low-resource community question answering
    HajiAminShirazi, Shahrzad
    Momtazi, Saeedeh
    MACHINE TRANSLATION, 2020, 34 (04) : 287 - 303
  • [3] A Deep Learning model for Question Analysis in Low-resource Languages: A Dataset and Case Study for Persian
    Khaksefidi, Fatemeh Ebrahimi
    Fatemi, Afsaneh
    Nematbakhsh, Mohammad Ali
    Kia, Mahsa Abazari
    2024 14TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION SYSTEMS, ICPRS, 2024,
  • [4] NLPashto: NLP Toolkit for Low-resource Pashto Language
    Haq, Ijazul
    Qiu, Weidong
    Guo, Jie
    Tang, Peng
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (06) : 1344 - 1352
  • [5] Multilingual Offensive Language Identification for Low-resource Languages
    Ranasinghe, Tharindu
    Zampieri, Marcos
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (01)
  • [6] Language fusion via adapters for low-resource speech recognition
    Hu, Qing
    Zhang, Yan
    Zhang, Xianlei
    Han, Zongyu
    Liang, Xiuxia
    SPEECH COMMUNICATION, 2024, 158
  • [7] Exploring Bilingual Word Embeddings for Hiligaynon, a Low-Resource Language
    Michel, Leah
    Hangya, Viktor
    Fraser, Alexander
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2573 - 2580
  • [8] Towards Language Service Creation and Customization for Low-Resource Languages
    Lin, Donghui
    Murakami, Yohei
    Ishida, Toru
    INFORMATION, 2020, 11 (02)
  • [9] A General Procedure for Improving Language Models in Low-Resource Speech Recognition
    Liu, Qian
    Zhang, Wei-Qiang
    Liu, Jia
    Liu, Yao
    PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 428 - 433
  • [10] Evaluating Phonemic Transcription of Low-Resource Tonal Languages for Language Documentation
    Adams, Oliver
    Cohn, Trevor
    Neubig, Graham
    Cruz, Hilaria
    Bird, Steven
    Michaud, Alexis
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3356 - 3365