Building a benchmark dataset for the Kurdish news question answering

被引:0
|
作者
Saeed, Ari M. [1 ]
机构
[1] Univ Halabja, Coll Sci, Comp Sci Dept, Halabja, Kurdistan Regio, Iraq
来源
DATA IN BRIEF | 2024年 / 57卷
关键词
Kurdish question answering system; Kurdish news dataset; Data mining; Text pre-processing; Machine learning;
D O I
10.1016/j.dib.2024.110916
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
This article presents the Kurdish News Question Answering Dataset (KNQAD). The texts are collected from various Kurdish news websites. The ParsHub software is used to extract data from different fields of news, such as social news, religion, sports, science, and economy. The dataset consists of 15,002 news paragraphs with question-answer pairs. For each news paragraph, one or more question-answer pairs are manually created based on the content of the paragraphs. The dataset is pre-processed by cleaning and normalizing the data. During the cleaning process, special characters and stop words are removed, and stemming is used as a normalization step. The distribution of each question type is presented in the KNQAD. Moreover, the complexity of the QA problem is analyzed in the KNQAD by using lexical similarity techniques between questions and answers. (c) 2024 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/ )
引用
收藏
页数:12
相关论文
共 50 条
  • [31] A benchmark dataset and workflow for landslide susceptibility zonation
    Alvioli, Massimiliano
    Loche, Marco
    Jacobs, Liesbet
    Grohmann, Carlos H.
    Abraham, Minu Treesa
    Gupta, Kunal
    Satyam, Neelima
    Scaringi, Gianvito
    Bornaetxea, Txomin
    Rossi, Mauro
    Marchesini, Ivan
    Lombardo, Luigi
    Moreno, Mateo
    Steger, Stefan
    Camera, Corrado A. S.
    Bajni, Greta
    Samodra, Guruh
    Wahyudi, Erwin Eko
    Susyanto, Nanang
    Sincic, Marko
    Gazibara, Sanja Bernat
    Sirbu, Flavius
    Torizin, Jewgenij
    Schuessler, Nick
    Mirus, Benjamin B.
    Woodard, Jacob B.
    Aguilera, Hector
    Rivera-Rivera, Jhonatan
    EARTH-SCIENCE REVIEWS, 2024, 258
  • [32] Question Answering Systems: A Systematic Literature Review
    Alanazi, Sarah Saad
    Elfadil, Nazar
    Jarajreh, Mutsam
    Algarni, Saad
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (03) : 495 - 502
  • [33] Mechanical MNIST: A benchmark dataset for mechanical metamodels
    Lejeune, Emma
    EXTREME MECHANICS LETTERS, 2020, 36
  • [34] Biomedical Question Answering: A Survey of Approaches and Challenges
    Jin, Qiao
    Yuan, Zheng
    Xiong, Guangzhi
    Yu, Qianlan
    Ying, Huaiyuan
    Tan, Chuanqi
    Chen, Mosha
    Huang, Songfang
    Liu, Xiaozhong
    Yu, Sheng
    ACM COMPUTING SURVEYS, 2023, 55 (02)
  • [35] A Machine Learning Approach for Ranking in Question Answering
    Amato, Alba
    Coronato, Antonio
    ADVANCES ON P2P, PARALLEL, GRID, CLOUD AND INTERNET COMPUTING (3PGCIC-2017), 2018, 13 : 89 - 98
  • [36] A Machine Learning Approach for Factoid Question Answering
    Sal, David Dominguez
    Surdeanu, Mihai
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2006, (37): : 131 - 136
  • [37] An Educational News Dataset for Recommender Systems
    Xing, Yujie
    Mohallick, Itishree
    Gulla, Jon Atle
    Ozgobek, Ozlem
    Zhang, Lemei
    ECML PKDD 2020 WORKSHOPS, 2020, 1323 : 562 - 570
  • [38] AQA: a multilingual Anaphora annotation scheme for Question Answering
    Boldrini, E.
    Puchol-Blasco, M.
    Navarro, B.
    Martinez-Barco, P.
    Vargas-Sierra, C.
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2009, (42): : 97 - 104
  • [39] Sentence topics based knowledge acquisition for question answering
    Oh, Hyo-Jung
    Yun, Bo-Hyun
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2008, E91D (04): : 969 - 975
  • [40] Time and Object Based Question and Answering System for Turkish
    Ayverdi, Serkan
    Oncevarlik, Adnan
    Ucar, Muhammet
    Adali, Esref
    2020 5TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2020, : 372 - 377