ArQuAD: An Expert-Annotated Arabic Machine Reading Comprehension Dataset

被引：0

作者：

Obeidat, Rasha ^{[2
]}

Al-Harbi, Marwa ^{[2
]}

Al-Ayyoub, Mahmoud ^{[1
,2
]}

Alawneh, Luay ^{[2
]}

机构：

[1] Ajman Univ, Ajman, U Arab Emirates

[2] Jordan Univ Sci & Technol, Irbid, Jordan

来源：

COGNITIVE COMPUTATION | 2024年 / 16卷 / 03期

关键词：

Arabic Machine Reading Comprehension dataset; MRC; Question answering; Deep learning; Transformers; Data collection; AraBERT; QUESTION; BENCHMARK;

D O I：

10.1007/s12559-024-10248-6

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Machine Reading Comprehension (MRC) is a task that enables machines to mirror key cognitive processes involving reading, comprehending a text passage, and answering questions about it. There has been significant progress in this task for English in recent years, where recent systems not only surpassed human-level performance but also demonstrated advancements in emulating complex human cognitive processes. However, the development of Arabic MRC has not kept pace due to language challenges and the lack of large-scale, high-quality datasets. Existing datasets are either small, low quality or released as a part of large multilingual corpora. We present the Arabic Question Answering Dataset (ArQuaD), a large MRC dataset for the Arabic language. The dataset comprises 16,020 questions posed by language experts on passages extracted from Arabic Wikipedia articles, where the answer to each question is a text segment from the corresponding reading passage. Besides providing various dataset analyses, we fine-tuned several pre-trained language models to obtain benchmark results. Among the compared methods, AraBERTv0.2-large achieved the best performance with an exact match of 68.95% and an F1-score of 87.15%. However, the significantly higher performance observed in human evaluations (exact match of 86% and F1-score of 95.5%) suggests a significant margin of possible improvement in future research. We release the dataset publicly at https://github.com/RashaMObeidat/ArQuAD to encourage further development of language-aware MRC models for the Arabic language.

引用

页码：984 / 1003

页数：20

共 50 条

[1] Expert-Annotated Dataset to Study Cyberbullying in Polish Language
Ptaszynski, Michal
Pieciukiewicz, Agata
Dybala, Pawel
Skrzek, Pawel
Soliwoda, Kamil
Fortuna, Marcin
Leliwa, Gniewosz
Wroczynski, Michal
DATA, 2024, 9 (01)
[2] VisImages: A Fine-Grained Expert-Annotated Visualization Dataset
Deng, Dazhen
Wu, Yihong
Shu, Xinhuan
Wu, Jiang
Fu, Siwei
Cui, Weiwei
Wu, Yingcai
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2023, 29 (07) : 3298 - 3311
[3] Biasly: An Expert-Annotated Dataset for Subtle Misogyny Detection and Mitigation
Sheppare, Brooklyn
Richter, Anna
Cohen, Allison
Smith, Elizabeth Allyn
Kneese, Tamara
Pelletier, Carolyne
Baldini, Ioana
Dong, Yue
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 427 - 452
[4] ANNO-MI: A DATASET OF EXPERT-ANNOTATED COUNSELLING DIALOGUES
Wu, Zixiu
Balloccu, Simone
Kumar, Vivek
Helaoui, Rim
Reiter, Ehud
Recupero, Diego Reforgiato
Riboni, Daniele
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6177 - 6181
[5] MAUD: An Expert-Annotated Legal NLP Dataset for Merger Agreement Understanding
Wang, Steven H.
Scardigli, Antoine
Tang, Leonard
Chen, Wei
Levkin, Dimitry
Chen, Anya
Ball, Spencer
Woodside, Thomas
Zhang, Oliver
Hendrycks, Dan
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 16369 - 16382
[6] Creation, Analysis and Evaluation of AnnoMI, a Dataset of Expert-Annotated Counselling Dialogues
Wu, Zixiu
Balloccu, Simone
Kumar, Vivek
Helaoui, Rim
Recupero, Diego Reforgiato
Riboni, Daniele
FUTURE INTERNET, 2023, 15 (03)
[7] STANDER: An Expert-Annotated Dataset for News Stance Detection and Evidence Retrieval
Conforti, Costanza
Berndt, Jakob
Pilehvar, Mohammad Taher
Giannitsarou, Chryssi
Toxvaerd, Flavio
Collier, Nigel
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 4086 - 4101
[8] Conceptual Questions in Developing Expert-Annotated Data
Ma, Megan
Waldon, Brandon
Nyarko, Julian
PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND LAW, ICAIL 2023, 2023, : 427 - 431
[9] BIOMRC: A Dataset for Biomedical Machine Reading Comprehension
Stavropoulos, Petros
Pappas, Dimitris
Androutsopoulos, Ion
McDonald, Ryan
19TH SIGBIOMED WORKSHOP ON BIOMEDICAL LANGUAGE PROCESSING (BIONLP 2020), 2020, : 140 - 149
[10] Methods and Trends of Machine Reading Comprehension in the Arabic Language
Alkhatnai, Mubarak
Amjad, Hamza Imam
Amjad, Maaz
Gelbukh, Alexander
COMPUTACION Y SISTEMAS, 2020, 24 (04): : 1607 - 1615

← 1 2 3 4 5 →