Building a benchmark dataset for the Kurdish news question answering

被引：0

作者：

Saeed, Ari M. ^{[1
]}

机构：

[1] Univ Halabja, Coll Sci, Comp Sci Dept, Halabja, Kurdistan Regio, Iraq

来源：

DATA IN BRIEF | 2024年 / 57卷

关键词：

Kurdish question answering system; Kurdish news dataset; Data mining; Text pre-processing; Machine learning;

D O I：

10.1016/j.dib.2024.110916

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

This article presents the Kurdish News Question Answering Dataset (KNQAD). The texts are collected from various Kurdish news websites. The ParsHub software is used to extract data from different fields of news, such as social news, religion, sports, science, and economy. The dataset consists of 15,002 news paragraphs with question-answer pairs. For each news paragraph, one or more question-answer pairs are manually created based on the content of the paragraphs. The dataset is pre-processed by cleaning and normalizing the data. During the cleaning process, special characters and stop words are removed, and stemming is used as a normalization step. The distribution of each question type is presented in the KNQAD. Moreover, the complexity of the QA problem is analyzed in the KNQAD by using lexical similarity techniques between questions and answers. (c) 2024 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/ )

引用

页数：12

共 50 条

[1] IFND: a benchmark dataset for fake news detection
Sharma, Dilip Kumar
Garg, Sonal
COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (03) : 2843 - 2863
[2] IFND: a benchmark dataset for fake news detection
Dilip Kumar Sharma
Sonal Garg
Complex & Intelligent Systems, 2023, 9 : 2843 - 2863
[3] Kurdish News Dataset Headlines (KNDH) through multiclass classification
Badawi, Soran
Saeed, Ari M.
Ahmed, Sara A.
Abdalla, Peshraw Ahmed
Hassan, Diyari A.
DATA IN BRIEF, 2023, 48
[4] DAWQAS: A Dataset for Arabic Why Question Answering System
Ismail, Walaa Saber
Homsi, Masun Nabhan
ARABIC COMPUTATIONAL LINGUISTICS, 2018, 142 : 123 - 131
[5] A survey on the multiple classifier for new benchmark dataset of Vietnamese news classification
Huu-Thanh Duong
Vinh Truong Hoang
2019 11TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SMART TECHNOLOGY (KST), 2019, : 23 - 28
[6] Efficient Management and Optimization of Very Large Machine Learning Dataset for Question Answering
Medved, Marek
Sabol, Radoslav
Horak, Ales
RECENT ADVANCES IN SLAVONIC NATURAL LANGUAGE PROCESSING (RASLAN 2020), 2020, : 23 - 34
[7] Template-based Question Answering analysis on the LC-QuAD2.0 Dataset
Dileep, Akshay Kumar
Mishra, Anurag
Mehta, Ria
Uppal, Siddharth
Chakraborty, Jaydeep
Bansal, Srividya K.
2021 IEEE 15TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC 2021), 2021, : 443 - 448
[8] "Bend the truth": Benchmark dataset for fake news detection in Urdu language and its evaluation
Amjad, Maaz
Sidorov, Grigori
Zhila, Alisa
Gomez-Adorno, Helena
Voronkov, Ilia
Gelbukh, Alexander
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 39 (02) : 2457 - 2469
[9] Explicable Question Answering
Kacupaj, Endri
SEMANTIC WEB: ESWC 2020 SATELLITE EVENTS, 2020, 12124 : 261 - 269
[10] Medical dataset classification for Kurdish short text over social media
Saeed, Ari M.
Hussein, Shnya R.
Ali, Chro M.
Rashid, Tarik A.
DATA IN BRIEF, 2022, 42

← 1 2 3 4 5 →