ISE-Hate: A benchmark corpus for inter-faith, sectarian, and ethnic hatred detection on social media in Urdu

被引:17
作者
Akram, Muhammad Hammad [1 ]
Shahzad, Khurram [2 ]
Bashir, Maryam [1 ]
机构
[1] Natl Univ Comp & Emerging Sci, FAST Sch Comp, Lahore, Pakistan
[2] Univ Punjab, Dept Data Sci, Lahore, Pakistan
关键词
Hateful content detection; Urdu; Corpus generation; Sectarian; BERT; Ethnic; Hatred; LANGUAGE; TWITTER;
D O I
10.1016/j.ipm.2023.103270
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Social media has become the most popular platform for free speech. This freedom of speech has given opportunities to the oppressed to raise their voice against injustices, but on the other hand, this has led to a disturbing trend of spreading hateful content of various kinds. Pakistan has been dealing with the issue of sectarian and ethnic violence for the last three decades and now due to freedom of speech, there is a growing trend of disturbing content about religion, sect, and ethnicity on social media. This necessitates the need for an automated system for the detection of controversial content on social media in Urdu which is the national language of Pakistan. The biggest hurdle that has thwarted the Urdu language processing is the scarcity of language resources, annotated datasets, and pretrained language models. In this study, we have addressed the problem of detecting Interfaith, Sectarian, and Ethnic hatred on social media in Urdu language using machine learning and deep learning techniques. In particular, we have: (1) developed and presented guidelines for annotating Urdu text with appropriate labels for two levels of classification, (2) developed a large dataset of 21,759 tweets using the developed guidelines and made it publicly available, and (3) conducted experiments to compare the performance of eight supervised machine learning and deep learning techniques, for the automated identification of hateful content. In the first step, experiments are performed for the hateful content detection as a binary classification task, and in the second step, the classification of Interfaith, Sectarian and Ethnic hatred detection is performed as a multiclass classification task. Overall, Bidirectional Encoder Representation from Transformers (BERT) proved to be the most effective technique for hateful content identification in Urdu tweets.
引用
收藏
页数:23
相关论文
共 51 条
[1]   Detection and classification of social media-based extremist affiliations using sentiment analysis techniques [J].
Ahmad, Shakeel ;
Asghar, Muhammad Zubair ;
Alotaibi, Fahad M. ;
Awan, Irfanullah .
HUMAN-CENTRIC COMPUTING AND INFORMATION SCIENCES, 2019, 9
[2]   Automatic Detection of Offensive Language for Urdu and Roman Urdu [J].
Akhter, Muhammad Pervez ;
Zheng Jiangbin ;
Naqvi, Irfan Raza ;
Abdelmajeed, Mohammed ;
Sadiq, Muhammad Tariq .
IEEE ACCESS, 2020, 8 :91213-91226
[3]  
Ali M. Z., 2021, IEEE ACCESS
[4]  
Amjad M., 2021, IEEE ACCESS
[5]  
Anger I., 2011, P 11 INT C KNOWL TEC, P1, DOI DOI 10.1145/2024288.2024326
[6]  
[Anonymous], 2022, WORLDS LANG 7 MAPS C
[7]  
[Anonymous], 2020, TOP HASHT PAK
[8]  
[Anonymous], About Us: NSF INCLUDES
[9]  
[Anonymous], 2020, RES 100 MILL TWEETS
[10]  
Ansari Z., 2020, International Journal of Linguistics and Culture, V1, P165, DOI DOI 10.52700/IJLC.V1I2.20