Building a benchmark dataset for the Kurdish news question answering

被引:0
|
作者
Saeed, Ari M. [1 ]
机构
[1] Univ Halabja, Coll Sci, Comp Sci Dept, Halabja, Kurdistan Regio, Iraq
来源
DATA IN BRIEF | 2024年 / 57卷
关键词
Kurdish question answering system; Kurdish news dataset; Data mining; Text pre-processing; Machine learning;
D O I
10.1016/j.dib.2024.110916
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
This article presents the Kurdish News Question Answering Dataset (KNQAD). The texts are collected from various Kurdish news websites. The ParsHub software is used to extract data from different fields of news, such as social news, religion, sports, science, and economy. The dataset consists of 15,002 news paragraphs with question-answer pairs. For each news paragraph, one or more question-answer pairs are manually created based on the content of the paragraphs. The dataset is pre-processed by cleaning and normalizing the data. During the cleaning process, special characters and stop words are removed, and stemming is used as a normalization step. The distribution of each question type is presented in the KNQAD. Moreover, the complexity of the QA problem is analyzed in the KNQAD by using lexical similarity techniques between questions and answers. (c) 2024 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/ )
引用
收藏
页数:12
相关论文
共 50 条
  • [1] IFND: a benchmark dataset for fake news detection
    Sharma, Dilip Kumar
    Garg, Sonal
    COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (03) : 2843 - 2863
  • [2] IFND: a benchmark dataset for fake news detection
    Dilip Kumar Sharma
    Sonal Garg
    Complex & Intelligent Systems, 2023, 9 : 2843 - 2863
  • [3] Kurdish News Dataset Headlines (KNDH) through multiclass classification
    Badawi, Soran
    Saeed, Ari M.
    Ahmed, Sara A.
    Abdalla, Peshraw Ahmed
    Hassan, Diyari A.
    DATA IN BRIEF, 2023, 48
  • [4] DAWQAS: A Dataset for Arabic Why Question Answering System
    Ismail, Walaa Saber
    Homsi, Masun Nabhan
    ARABIC COMPUTATIONAL LINGUISTICS, 2018, 142 : 123 - 131
  • [5] A survey on the multiple classifier for new benchmark dataset of Vietnamese news classification
    Huu-Thanh Duong
    Vinh Truong Hoang
    2019 11TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SMART TECHNOLOGY (KST), 2019, : 23 - 28
  • [6] Efficient Management and Optimization of Very Large Machine Learning Dataset for Question Answering
    Medved, Marek
    Sabol, Radoslav
    Horak, Ales
    RECENT ADVANCES IN SLAVONIC NATURAL LANGUAGE PROCESSING (RASLAN 2020), 2020, : 23 - 34
  • [7] Template-based Question Answering analysis on the LC-QuAD2.0 Dataset
    Dileep, Akshay Kumar
    Mishra, Anurag
    Mehta, Ria
    Uppal, Siddharth
    Chakraborty, Jaydeep
    Bansal, Srividya K.
    2021 IEEE 15TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC 2021), 2021, : 443 - 448
  • [8] "Bend the truth": Benchmark dataset for fake news detection in Urdu language and its evaluation
    Amjad, Maaz
    Sidorov, Grigori
    Zhila, Alisa
    Gomez-Adorno, Helena
    Voronkov, Ilia
    Gelbukh, Alexander
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 39 (02) : 2457 - 2469
  • [9] Explicable Question Answering
    Kacupaj, Endri
    SEMANTIC WEB: ESWC 2020 SATELLITE EVENTS, 2020, 12124 : 261 - 269
  • [10] Medical dataset classification for Kurdish short text over social media
    Saeed, Ari M.
    Hussein, Shnya R.
    Ali, Chro M.
    Rashid, Tarik A.
    DATA IN BRIEF, 2022, 42