ArQuAD: An Expert-Annotated Arabic Machine Reading Comprehension Dataset

被引:0
|
作者
Obeidat, Rasha [2 ]
Al-Harbi, Marwa [2 ]
Al-Ayyoub, Mahmoud [1 ,2 ]
Alawneh, Luay [2 ]
机构
[1] Ajman Univ, Ajman, U Arab Emirates
[2] Jordan Univ Sci & Technol, Irbid, Jordan
关键词
Arabic Machine Reading Comprehension dataset; MRC; Question answering; Deep learning; Transformers; Data collection; AraBERT; QUESTION; BENCHMARK;
D O I
10.1007/s12559-024-10248-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machine Reading Comprehension (MRC) is a task that enables machines to mirror key cognitive processes involving reading, comprehending a text passage, and answering questions about it. There has been significant progress in this task for English in recent years, where recent systems not only surpassed human-level performance but also demonstrated advancements in emulating complex human cognitive processes. However, the development of Arabic MRC has not kept pace due to language challenges and the lack of large-scale, high-quality datasets. Existing datasets are either small, low quality or released as a part of large multilingual corpora. We present the Arabic Question Answering Dataset (ArQuaD), a large MRC dataset for the Arabic language. The dataset comprises 16,020 questions posed by language experts on passages extracted from Arabic Wikipedia articles, where the answer to each question is a text segment from the corresponding reading passage. Besides providing various dataset analyses, we fine-tuned several pre-trained language models to obtain benchmark results. Among the compared methods, AraBERTv0.2-large achieved the best performance with an exact match of 68.95% and an F1-score of 87.15%. However, the significantly higher performance observed in human evaluations (exact match of 86% and F1-score of 95.5%) suggests a significant margin of possible improvement in future research. We release the dataset publicly at https://github.com/RashaMObeidat/ArQuAD to encourage further development of language-aware MRC models for the Arabic language.
引用
收藏
页码:984 / 1003
页数:20
相关论文
共 50 条
  • [1] Expert-Annotated Dataset to Study Cyberbullying in Polish Language
    Ptaszynski, Michal
    Pieciukiewicz, Agata
    Dybala, Pawel
    Skrzek, Pawel
    Soliwoda, Kamil
    Fortuna, Marcin
    Leliwa, Gniewosz
    Wroczynski, Michal
    DATA, 2024, 9 (01)
  • [2] VisImages: A Fine-Grained Expert-Annotated Visualization Dataset
    Deng, Dazhen
    Wu, Yihong
    Shu, Xinhuan
    Wu, Jiang
    Fu, Siwei
    Cui, Weiwei
    Wu, Yingcai
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2023, 29 (07) : 3298 - 3311
  • [3] Biasly: An Expert-Annotated Dataset for Subtle Misogyny Detection and Mitigation
    Sheppare, Brooklyn
    Richter, Anna
    Cohen, Allison
    Smith, Elizabeth Allyn
    Kneese, Tamara
    Pelletier, Carolyne
    Baldini, Ioana
    Dong, Yue
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 427 - 452
  • [4] ANNO-MI: A DATASET OF EXPERT-ANNOTATED COUNSELLING DIALOGUES
    Wu, Zixiu
    Balloccu, Simone
    Kumar, Vivek
    Helaoui, Rim
    Reiter, Ehud
    Recupero, Diego Reforgiato
    Riboni, Daniele
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6177 - 6181
  • [5] MAUD: An Expert-Annotated Legal NLP Dataset for Merger Agreement Understanding
    Wang, Steven H.
    Scardigli, Antoine
    Tang, Leonard
    Chen, Wei
    Levkin, Dimitry
    Chen, Anya
    Ball, Spencer
    Woodside, Thomas
    Zhang, Oliver
    Hendrycks, Dan
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 16369 - 16382
  • [6] Creation, Analysis and Evaluation of AnnoMI, a Dataset of Expert-Annotated Counselling Dialogues
    Wu, Zixiu
    Balloccu, Simone
    Kumar, Vivek
    Helaoui, Rim
    Recupero, Diego Reforgiato
    Riboni, Daniele
    FUTURE INTERNET, 2023, 15 (03)
  • [7] STANDER: An Expert-Annotated Dataset for News Stance Detection and Evidence Retrieval
    Conforti, Costanza
    Berndt, Jakob
    Pilehvar, Mohammad Taher
    Giannitsarou, Chryssi
    Toxvaerd, Flavio
    Collier, Nigel
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 4086 - 4101
  • [8] Conceptual Questions in Developing Expert-Annotated Data
    Ma, Megan
    Waldon, Brandon
    Nyarko, Julian
    PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND LAW, ICAIL 2023, 2023, : 427 - 431
  • [9] BIOMRC: A Dataset for Biomedical Machine Reading Comprehension
    Stavropoulos, Petros
    Pappas, Dimitris
    Androutsopoulos, Ion
    McDonald, Ryan
    19TH SIGBIOMED WORKSHOP ON BIOMEDICAL LANGUAGE PROCESSING (BIONLP 2020), 2020, : 140 - 149
  • [10] Methods and Trends of Machine Reading Comprehension in the Arabic Language
    Alkhatnai, Mubarak
    Amjad, Hamza Imam
    Amjad, Maaz
    Gelbukh, Alexander
    COMPUTACION Y SISTEMAS, 2020, 24 (04): : 1607 - 1615