Toxic language detection: A systematic review of Arabic datasets

被引:3
|
作者
Bensalem, Imene [1 ,2 ]
Rosso, Paolo [3 ]
Zitouni, Hanane [4 ]
机构
[1] ESCF Constantine, Constantine, Algeria
[2] Constantine 2 Univ, MISC Lab, Constantine, Algeria
[3] Univ Politecn Valencia, Valencia, Spain
[4] Constantine 2 Univ, Constantine, Algeria
关键词
annotation; Arabic datasets; dataset accessibility; dataset reusability; hate speech; offensive language; toxic language;
D O I
10.1111/exsy.13551
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The detection of toxic language in the Arabic language has emerged as an active area of research in recent years, and reviewing the existing datasets employed for training the developed solutions has become a pressing need. This paper offers a comprehensive survey of Arabic datasets focused on online toxic language. We systematically gathered a total of 54 available datasets and their corresponding papers and conducted a thorough analysis, considering 18 criteria across four primary dimensions: availability details, content, annotation process, and reusability. This analysis enabled us to identify existing gaps and make recommendations for future research works. For the convenience of the research community, the list of the analysed datasets is maintained in a GitHub repository.
引用
收藏
页数:30
相关论文
共 50 条
  • [41] Arabic Offensive and Hate Speech Detection Using a Cross-Corpora Multi-Task Learning Model
    Aldjanabi, Wassen
    Dahou, Abdelghani
    Al-qaness, Mohammed A. A.
    Abd Elaziz, Mohamed
    Helmi, Ahmed Mohamed
    Damasevicius, Robertas
    INFORMATICS-BASEL, 2021, 8 (04):
  • [42] Toxic, Hateful, Offensive or Abusive? What Are We Really Classifying? An Empirical Analysis of Hate Speech Datasets
    Fortuna, Paula
    Soler-Company, Juan
    Wanner, Leo
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6786 - 6794
  • [43] Annotated Corpus with Negation and Speculation in Arabic Review Domain: NSAR
    Mahany, Ahmed
    Khaled, Heba
    Elmitwally, Nouh Sabri
    Aljohani, Naif
    Ghoniemy, Said
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (07) : 38 - 46
  • [44] Semantic role labeling for Arabic language using case-based reasoning approach
    Meguehout H.
    Bouhadada T.
    Laskri M.T.
    Meguehout, Hamza (meguehout.hamza@gmail.com), 1600, Springer Science and Business Media, LLC (20): : 363 - 372
  • [45] Discovering the Cognitive Bias of Toxic Language Through Metaphorical Concept Mappings
    Ge, Mengshi
    Mao, Rui
    Cambria, Erik
    COGNITIVE COMPUTATION, 2025, 17 (01)
  • [46] Detection of Offensive Language and ITS Severity for Low Resource Language
    Saeed, Ramsha
    Afzal, Hammad
    Rauf, Sadaf Abdul
    Iltaf, Naima
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (06)
  • [47] arHateDetector: detection of hate speech from standard and dialectal Arabic Tweets
    Khezzar R.
    Moursi A.
    Al Aghbari Z.
    Discover Internet of Things, 2023, 3 (01):
  • [48] Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text
    ALBayari, Reem
    Abdallah, Sherief
    DATA, 2022, 7 (07)
  • [49] Deradicalizing YouTube: Characterization, Detection, and Personalization of Religiously Intolerant Arabic Videos
    Albadi N.
    Kurdi M.
    Mishra S.
    Proceedings of the ACM on Human-Computer Interaction, 2022, 6 (CSCW2)
  • [50] Adverse drug event detection using natural language processing: A scoping review of supervised learning methods
    Murphy, Rachel M.
    Klopotowska, Joanna E.
    de Keizer, Nicolette F.
    Jager, Kitty J.
    Leopold, Jan Hendrik
    Dongelmans, Dave A.
    Abu-Hanna, Ameen
    Schut, Martijn C.
    PLOS ONE, 2023, 18 (01):