Toxic language detection: A systematic review of Arabic datasets

被引:3
|
作者
Bensalem, Imene [1 ,2 ]
Rosso, Paolo [3 ]
Zitouni, Hanane [4 ]
机构
[1] ESCF Constantine, Constantine, Algeria
[2] Constantine 2 Univ, MISC Lab, Constantine, Algeria
[3] Univ Politecn Valencia, Valencia, Spain
[4] Constantine 2 Univ, Constantine, Algeria
关键词
annotation; Arabic datasets; dataset accessibility; dataset reusability; hate speech; offensive language; toxic language;
D O I
10.1111/exsy.13551
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The detection of toxic language in the Arabic language has emerged as an active area of research in recent years, and reviewing the existing datasets employed for training the developed solutions has become a pressing need. This paper offers a comprehensive survey of Arabic datasets focused on online toxic language. We systematically gathered a total of 54 available datasets and their corresponding papers and conducted a thorough analysis, considering 18 criteria across four primary dimensions: availability details, content, annotation process, and reusability. This analysis enabled us to identify existing gaps and make recommendations for future research works. For the convenience of the research community, the list of the analysed datasets is maintained in a GitHub repository.
引用
收藏
页数:30
相关论文
共 50 条
  • [1] A Survey of Offensive Language Detection for the Arabic Language
    Husain, Fatemah
    Uzuner, Ozlem
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2021, 20 (01)
  • [2] Fake-news detection: a survey of evaluation Arabic datasets
    Yousef, Mohammed Abbas
    Elkorany, Abeer
    Bayomi, Hanaa
    SOCIAL NETWORK ANALYSIS AND MINING, 2024, 14 (01)
  • [3] Offensive Language Detection from Arabic Texts
    Awajan, Arafat A.
    INTELLIGENT COMPUTING, VOL 3, 2024, 2024, 1018 : 77 - 91
  • [4] Detection of Hateful Social Media Content for Arabic Language
    Al-Ibrahim, Rogayah M.
    Ali, Mostafa Z.
    Najadat, Hassan M.
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (09)
  • [5] A systematic review of hate speech automatic detection using natural language processing
    Jahan, Md Saroar
    Oussalah, Mourad
    NEUROCOMPUTING, 2023, 546
  • [6] A Multi-Platform Arabic News Comment Dataset for Offensive Language Detection
    Chowdhury, Shammur A.
    Mubarak, Hamdy
    Abdelali, Ahmed
    Jung, Soon-gyo
    Jansen, Bernard J.
    Salminen, Joni
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6203 - 6212
  • [7] Detection of Arabic offensive language in social media using machine learning models
    Mousa, Aya
    Shahin, Ismail
    Nassif, Ali Bou
    Elnagar, Ashraf
    INTELLIGENT SYSTEMS WITH APPLICATIONS, 2024, 22
  • [8] Hate Speech Detection using Word Embedding and Deep Learning in the Arabic Language Context
    Faris, Hossam
    Aljarah, Ibrahim
    Habib, Maria
    Castillo, Pedro A.
    ICPRAM: PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION APPLICATIONS AND METHODS, 2020, : 453 - 460
  • [9] Augmented Behavioral Annotation Tools, with Application to Multimodal Datasets and Models: A Systematic Review
    Watson, Eleanor
    Viana, Thiago
    Zhang, Shujun
    AI, 2023, 4 (01) : 128 - 171
  • [10] Arabic Text Mining: A Systematic Review of the Published Literature 2002-2014
    Al-Mahmoud, Hind
    Al-Razgan, Muna
    2015 INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (ICCC), 2015, : 65 - 71