Toxic language detection: A systematic review of Arabic datasets

被引：3

作者：

Bensalem, Imene ^{[1
,2
]}

Rosso, Paolo ^{[3
]}

Zitouni, Hanane ^{[4
]}

机构：

[1] ESCF Constantine, Constantine, Algeria

[2] Constantine 2 Univ, MISC Lab, Constantine, Algeria

[3] Univ Politecn Valencia, Valencia, Spain

[4] Constantine 2 Univ, Constantine, Algeria

来源：

EXPERT SYSTEMS | 2024年 / 41卷 / 08期

关键词：

annotation; Arabic datasets; dataset accessibility; dataset reusability; hate speech; offensive language; toxic language;

D O I：

10.1111/exsy.13551

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The detection of toxic language in the Arabic language has emerged as an active area of research in recent years, and reviewing the existing datasets employed for training the developed solutions has become a pressing need. This paper offers a comprehensive survey of Arabic datasets focused on online toxic language. We systematically gathered a total of 54 available datasets and their corresponding papers and conducted a thorough analysis, considering 18 criteria across four primary dimensions: availability details, content, annotation process, and reusability. This analysis enabled us to identify existing gaps and make recommendations for future research works. For the convenience of the research community, the list of the analysed datasets is maintained in a GitHub repository.

引用

页数：30

共 50 条

[41] Arabic Offensive and Hate Speech Detection Using a Cross-Corpora Multi-Task Learning Model
Aldjanabi, Wassen
Dahou, Abdelghani
Al-qaness, Mohammed A. A.
Abd Elaziz, Mohamed
Helmi, Ahmed Mohamed
Damasevicius, Robertas
INFORMATICS-BASEL, 2021, 8 (04):
[42] Toxic, Hateful, Offensive or Abusive? What Are We Really Classifying? An Empirical Analysis of Hate Speech Datasets
Fortuna, Paula
Soler-Company, Juan
Wanner, Leo
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6786 - 6794
[43] Annotated Corpus with Negation and Speculation in Arabic Review Domain: NSAR
Mahany, Ahmed
Khaled, Heba
Elmitwally, Nouh Sabri
Aljohani, Naif
Ghoniemy, Said
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (07) : 38 - 46
[44] Semantic role labeling for Arabic language using case-based reasoning approach
Meguehout H.
Bouhadada T.
Laskri M.T.
Meguehout, Hamza (meguehout.hamza@gmail.com), 1600, Springer Science and Business Media, LLC (20): : 363 - 372
[45] Discovering the Cognitive Bias of Toxic Language Through Metaphorical Concept Mappings
Ge, Mengshi
Mao, Rui
Cambria, Erik
COGNITIVE COMPUTATION, 2025, 17 (01)
[46] Detection of Offensive Language and ITS Severity for Low Resource Language
Saeed, Ramsha
Afzal, Hammad
Rauf, Sadaf Abdul
Iltaf, Naima
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (06)
[47] arHateDetector: detection of hate speech from standard and dialectal Arabic Tweets
Khezzar R.
Moursi A.
Al Aghbari Z.
Discover Internet of Things, 2023, 3 (01):
[48] Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text
ALBayari, Reem
Abdallah, Sherief
DATA, 2022, 7 (07)
[49] Deradicalizing YouTube: Characterization, Detection, and Personalization of Religiously Intolerant Arabic Videos
Albadi N.
Kurdi M.
Mishra S.
Proceedings of the ACM on Human-Computer Interaction, 2022, 6 (CSCW2)
[50] Adverse drug event detection using natural language processing: A scoping review of supervised learning methods
Murphy, Rachel M.
Klopotowska, Joanna E.
de Keizer, Nicolette F.
Jager, Kitty J.
Leopold, Jan Hendrik
Dongelmans, Dave A.
Abu-Hanna, Ameen
Schut, Martijn C.
PLOS ONE, 2023, 18 (01):

← 1 2 3 4 5 →