Toxic language detection: A systematic review of Arabic datasets

被引:3
|
作者
Bensalem, Imene [1 ,2 ]
Rosso, Paolo [3 ]
Zitouni, Hanane [4 ]
机构
[1] ESCF Constantine, Constantine, Algeria
[2] Constantine 2 Univ, MISC Lab, Constantine, Algeria
[3] Univ Politecn Valencia, Valencia, Spain
[4] Constantine 2 Univ, Constantine, Algeria
关键词
annotation; Arabic datasets; dataset accessibility; dataset reusability; hate speech; offensive language; toxic language;
D O I
10.1111/exsy.13551
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The detection of toxic language in the Arabic language has emerged as an active area of research in recent years, and reviewing the existing datasets employed for training the developed solutions has become a pressing need. This paper offers a comprehensive survey of Arabic datasets focused on online toxic language. We systematically gathered a total of 54 available datasets and their corresponding papers and conducted a thorough analysis, considering 18 criteria across four primary dimensions: availability details, content, annotation process, and reusability. This analysis enabled us to identify existing gaps and make recommendations for future research works. For the convenience of the research community, the list of the analysed datasets is maintained in a GitHub repository.
引用
收藏
页数:30
相关论文
共 50 条
  • [21] Unlocking the Potential: A Comprehensive Systematic Review of ChatGPT in Natural Language Processing Tasks
    Alomari, Ebtesam Ahmad
    CMES-COMPUTER MODELING IN ENGINEERING & SCIENCES, 2024, 141 (01): : 43 - 85
  • [22] Arabic Offensive Language Classification: Leveraging Transformer, LSTM, and SVM
    Rasheed, Areeg Fahad
    Zarkoosh, M.
    Abbas, Safa F.
    Al-Azzawi, Sana Sabah
    2023 IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLIED NETWORK TECHNOLOGIES, ICMLANT, 2023, : 115 - 120
  • [23] Sentiment analysis methods for politics and hate speech contents in Spanish language: a systematic review
    del Valle Martin, Ernesto
    de la Fuente Valentin, Luis
    IEEE LATIN AMERICA TRANSACTIONS, 2023, 21 (03) : 408 - 418
  • [24] An Automatic Approach for the Identification of Offensive Language in Perso-Arabic Urdu Language: Dataset Creation and Evaluation
    Din, Salah Ud
    Khusro, Shah
    Khan, Farman Ali
    Ahmad, Munir
    Ali, Oualid
    Ghazal, Taher M.
    IEEE ACCESS, 2025, 13 : 19755 - 19769
  • [25] USAD: An Intelligent System for Slang and Abusive Text Detection in PERSO-Arabic-Scripted Urdu
    Ul Haq, Nauman
    Ullah, Mohib
    Khan, Rafiullah
    Ahmad, Arshad
    Almogren, Ahmad
    Hayat, Bashir
    Shafi, Bushra
    COMPLEXITY, 2020, 2020
  • [26] A survey on multi-lingual offensive language detection
    Mnassri, Khouloud
    Farahbakhsh, Reza
    Chalehchaleh, Razieh
    Rajapaksha, Praboda
    Jafari, Amir Reza
    Li, Guanlin
    Crespi, Noel
    PEERJ COMPUTER SCIENCE, 2024, 10
  • [27] Hate and offensive speech detection on Arabic social media
    Alsafari S.
    Sadaoui S.
    Mouhoub M.
    Online Social Networks and Media, 2020, 19
  • [28] Detection of fake news and hate speech for Ethiopian languages: a systematic review of the approaches
    Wubetu Barud Demilie
    Ayodeji Olalekan Salau
    Journal of Big Data, 9
  • [29] Detection of fake news and hate speech for Ethiopian languages: a systematic review of the approaches
    Demilie, Wubetu Barud
    Salau, Ayodeji Olalekan
    JOURNAL OF BIG DATA, 2022, 9 (01)
  • [30] A Review of Deep Learning Techniques for Multimodal Fake News and Harmful Languages Detection
    Festus Ayetiran, Eniafe
    Ozgobek, Ozlem
    IEEE ACCESS, 2024, 12 : 76133 - 76153