T-HSAB: A Tunisian Hate Speech and Abusive Dataset

被引:35
作者
Haddad, Hatem [1 ,3 ]
Mulki, Hala [2 ,3 ]
Oueslati, Asma [1 ]
机构
[1] Manouba Univ, Natl Sch Comp Sci, RIADI Lab, Manouba, Tunisia
[2] Konya Tech Univ, Dept Comp Engn, Konya, Turkey
[3] iCompass Consulting, Tunis, Tunisia
来源
ARABIC LANGUAGE PROCESSING: FROM THEORY TO PRACTICE, ICALP 2019 | 2019年 / 1108卷
关键词
Tunisian dialect; Abusive speech; Hate speech;
D O I
10.1007/978-3-030-32959-4_18
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Since the "Jasmine Revolution" at 2011, Tunisia has entered a new era of ultimate freedom of expression with a full access into social media. This has been associated with an unrestricted spread of toxic contents such as Abusive and Hate speech. Considering the psychological harm, let alone the potential hate crimes that might be caused by these toxic contents, automatic Abusive and Hate speech detection systems become a mandatory. This evokes the need for Tunisian benchmark datasets required to evaluate Abusive and Hate speech detection models. Being an underrepresented dialect, no previous Abusive or Hate speech datasets were provided for the Tunisian dialect. In this paper, we introduce the first publicly-available Tunisian Hate and Abusive speech (T-HSAB) dataset with the objective to be a benchmark dataset for automatic detection of online Tunisian toxic contents. We provide a detailed review of the data collection steps and how we design the annotation guidelines such that a reliable dataset annotation is guaranteed. This was later emphasized through the comprehensive evaluation of the annotations as the annotation agreement metrics of Cohen's Kappa (k) and Krippendorff's alpha (alpha) indicated the consistency of the annotations.
引用
收藏
页码:251 / 263
页数:13
相关论文
共 18 条
[1]  
Abozinadah E. A., 2015, Int. J. Knowl. Eng., V1, P113, DOI [10.7763/IJKE.2015.V1.19, DOI 10.7763/IJKE.2015.V1.19]
[2]  
Al-Ajlan M., 2018, P 21 SAUD COMP SOC N, P52
[3]  
Al-Hassan A., 2019, P 2019 6 INT C SOC N, P83, DOI DOI 10.5121/CSIT.2019.90208
[4]  
Alakrota A., 2018, J PROCEDIA COMPUT SC, V142, P174
[5]  
Albadi N, 2018, 2018 IEEE/ACM INTERNATIONAL CONFERENCE ON ADVANCES IN SOCIAL NETWORKS ANALYSIS AND MINING (ASONAM), P69, DOI 10.1109/ASONAM.2018.8508247
[6]  
Artstein R., 1968, PSYCHOL BULL, V70, P213
[7]  
Bird S., 2009, Natural language processing with Python: analyzing text with the natural language toolkit
[8]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[9]  
Cohen J., 2008, COMPUT LINGUIST, V34, P555
[10]  
de Gibert O., 2018, P 2 WORKSHOP ABUSIVE, P11, DOI [10.18653/v1/W18-5102, DOI 10.18653/V1/W18-5102]