KenSwQuAD-A Question Answering Dataset for Swahili Low-resource Language

被引：2

作者：

Wanjawa, Barack W. ^{[1
]}

Wanzare, Lilian D. A. ^{[2
]}

Indede, Florence ^{[2
]}

Mconyango, Owen ^{[2
]}

Muchemi, Lawrence ^{[1
]}

Ombui, Edward ^{[3
]}

机构：

[1] Univ Nairobi, POB 30197, Nairobi 00100, Kenya

[2] Maseno Univ, POB 333, Maseno, Kenya

[3] Africa Nazarene Univ, POB 53067, Nairobi 00200, Kenya

来源：

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING | 2023年 / 22卷 / 04期

关键词：

Swahili; question answer; low-resource languages;

D O I：

10.1145/3578553

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The need for question-answering (QA) datasets in low-resource languages is the motivation of this research, leading to the development of the Kencorpus Swahili Question Answering Dataset (KenSwQuAD). This dataset is annotated from raw story texts of Swahili, a low-resource language that is predominantly spoken in eastern Africa and in other parts of the world. Question-answering datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems. Machine learning systems need training data such as the gold-standard question-answering set developed in this research. The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project, a Kenyan languages corpus. The project annotated 1,445 texts from the total 2,585 texts with at least 5 QA pairs each, resulting in a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of the annotated texts confirmed that the QA pairs were all correctly annotated. A proof of concept on applying the set to the QA task confirmed that the dataset can be usable for such tasks. KenSwQuAD has also contributed to resourcing of the Swahili language.

引用

页数：20

共 32 条

[1] Model for Semantic Network Generation from Low Resource Languages as Applied to Question Answering - Case of Swahili
Wanjawa, Barack
Muchemi, Lawrence
2021 IST-AFRICA CONFERENCE (IST-AFRICA), 2021,
[2] Cross-lingual embedding for cross-lingual question retrieval in low-resource community question answering
HajiAminShirazi, Shahrzad
Momtazi, Saeedeh
MACHINE TRANSLATION, 2020, 34 (04) : 287 - 303
[3] A Deep Learning model for Question Analysis in Low-resource Languages: A Dataset and Case Study for Persian
Khaksefidi, Fatemeh Ebrahimi
Fatemi, Afsaneh
Nematbakhsh, Mohammad Ali
Kia, Mahsa Abazari
2024 14TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION SYSTEMS, ICPRS, 2024,
[4] NLPashto: NLP Toolkit for Low-resource Pashto Language
Haq, Ijazul
Qiu, Weidong
Guo, Jie
Tang, Peng
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (06) : 1344 - 1352
[5] Multilingual Offensive Language Identification for Low-resource Languages
Ranasinghe, Tharindu
Zampieri, Marcos
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (01)
[6] Language fusion via adapters for low-resource speech recognition
Hu, Qing
Zhang, Yan
Zhang, Xianlei
Han, Zongyu
Liang, Xiuxia
SPEECH COMMUNICATION, 2024, 158
[7] Exploring Bilingual Word Embeddings for Hiligaynon, a Low-Resource Language
Michel, Leah
Hangya, Viktor
Fraser, Alexander
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2573 - 2580
[8] Towards Language Service Creation and Customization for Low-Resource Languages
Lin, Donghui
Murakami, Yohei
Ishida, Toru
INFORMATION, 2020, 11 (02)
[9] A General Procedure for Improving Language Models in Low-Resource Speech Recognition
Liu, Qian
Zhang, Wei-Qiang
Liu, Jia
Liu, Yao
PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 428 - 433
[10] Evaluating Phonemic Transcription of Low-Resource Tonal Languages for Language Documentation
Adams, Oliver
Cohn, Trevor
Neubig, Graham
Cruz, Hilaria
Bird, Steven
Michaud, Alexis
PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3356 - 3365

← 1 2 3 4 →