Hate speech detection with ADHAR: a multi-dialectal hate speech corpus in Arabic

被引:2
作者
Charfi, Anis [1 ]
Besghaier, Mabrouka [1 ]
Akasheh, Raghda [1 ]
Atalla, Andria [1 ]
Zaghouani, Wajdi [2 ]
机构
[1] Carnegie Mellon Univ, Informat Syst Dept, Doha, Qatar
[2] Hamad Bin Khalifa Univ, Coll Humanities & Social Sci, Doha, Qatar
来源
FRONTIERS IN ARTIFICIAL INTELLIGENCE | 2024年 / 7卷
关键词
natural language processing; hate speech; Arabic language; dialectal Arabic; dataset annotation; Arabic corpora;
D O I
10.3389/frai.2024.1391472
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Hate speech detection in Arabic poses a complex challenge due to the dialectal diversity across the Arab world. Most existing hate speech datasets for Arabic cover only one dialect or one hate speech category. They also lack balance across dialects, topics, and hate/non-hate classes. In this paper, we address this gap by presenting ADHAR-a comprehensive multi-dialect, multi-category hate speech corpus for Arabic. ADHAR contains 70,369 words and spans four language variants: Modern Standard Arabic (MSA), Egyptian, Levantine, Gulf and Maghrebi. It covers four key hate speech categories: nationality, religion, ethnicity, and race. A major contribution is that ADHAR is carefully curated to maintain balance across dialects, categories, and hate/non-hate classes to enable unbiased dataset evaluation. We describe the systematic data collection methodology, followed by a rigorous annotation process involving multiple annotators per dialect. Extensive qualitative and quantitative analyses demonstrate the quality and usefulness of ADHAR. Our experiments with various classical and deep learning models demonstrate that our dataset enables the development of robust hate speech classifiers for Arabic, achieving accuracy and F1-scores of up to 90% for hate speech detection and up to 92% for category detection. When trained with Arabert, we achieved an accuracy and F1-score of 94% for hate speech detection, as well as 95% for the category detection.
引用
收藏
页数:12
相关论文
共 19 条
[1]   Detection of Hateful Social Media Content for Arabic Language [J].
Al-Ibrahim, Rogayah M. ;
Ali, Mostafa Z. ;
Najadat, Hassan M. .
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (09)
[2]  
Albadi N, 2018, 2018 IEEE/ACM INTERNATIONAL CONFERENCE ON ADVANCES IN SOCIAL NETWORKS ANALYSIS AND MINING (ASONAM), P69, DOI 10.1109/ASONAM.2018.8508247
[3]   ABMM: Arabic BERT-Mini Model for Hate-Speech Detection on Social Media [J].
Almaliki, Malik ;
Almars, Abdulqader M. ;
Gad, Ibrahim ;
Atlam, El-Sayed .
ELECTRONICS, 2023, 12 (04)
[4]   Hate Speech Epidemic. The Dynamic Effects of Derogatory Language on Intergroup Relations and Political Radicalization [J].
Bilewicz, Michal ;
Soral, Wiktor .
POLITICAL PSYCHOLOGY, 2020, 41 :3-33
[5]  
Caselli T, 2021, WOAH 2021: THE 5TH WORKSHOP ON ONLINE ABUSE AND HARMS, P17
[6]   A Deep Learning Framework for Automatic Detection of Hate Speech Embedded in Arabic Tweets [J].
Duwairi, Rehab ;
Hayajneh, Amena ;
Quwaider, Muhannad .
ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2021, 46 (04) :4001-4014
[7]   T-HSAB: A Tunisian Hate Speech and Abusive Dataset [J].
Haddad, Hatem ;
Mulki, Hala ;
Oueslati, Asma .
ARABIC LANGUAGE PROCESSING: FROM THEORY TO PRACTICE, ICALP 2019, 2019, 1108 :251-263
[8]  
Magnossao de Paula A.F., 2022, P 5 WORKSH OP SOURC, P181
[9]   Hate Speech Detection in Indonesian Twitter Texts using Bidirectional Gated Recurrent Unit [J].
Marpaung, Angela ;
Rismala, Rita ;
Nurrahmi, Hani .
2021 13TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SMART TECHNOLOGY (KST-2021), 2021, :186-190
[10]   BERT-based ensemble learning for multi-aspect hate speech detection [J].
Mazari, Ahmed Cherif ;
Boudoukhani, Nesrine ;
Djeffal, Abdelhamid .
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2024, 27 (01) :325-339