Threatening Language Detection and Target Identification in Urdu Tweets

被引:29
作者
Amjad, Maaz [1 ]
Ashraf, Noman [1 ]
Zhila, Alisa
Sidorov, Grigori [1 ]
Zubiaga, Arkaitz [2 ]
Gelbukh, Alexander [1 ]
机构
[1] Inst Politecn Nacl, Ctr Invest Comp CIC, Mexico City 07738, DF, Mexico
[2] Queen Mary Univ London, Sch Elect Engn & Comp Sci, London E1 4NS, England
来源
IEEE ACCESS | 2021年 / 9卷 / 09期
关键词
Threatening language detection; threat target identification; annotated dataset; Urdu language; OFFENSIVE LANGUAGE;
D O I
10.1109/ACCESS.2021.3112500
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic detection of threatening language is an important task, however, most of the existing studies focused on English as the target language, with limited work on low-resource languages. In this paper, we introduce and release a new dataset for threatening language detection in Urdu tweets to further research in this language. The proposed dataset contains 3,564 tweets manually annotated by human experts as either threatening or non-threatening. The threatening tweets are further classified by the target into one of two types: threatening to an individual person or threatening to a group. This research follows a two-step approach: (i) classify a given tweet as threatening or non-threatening and (ii) classify whether a threatening tweet is used to threaten an individual or a group. We compare three forms of text representation: two count-based, where the text is represented using either character n-gram counts or word n-gram counts as feature vectors and the third text representation is based on fastText pre-trained word embeddings for Urdu. We perform several experiments using machine learning and deep learning classifiers and our study shows that an MLP classifier with the combination of word n-gram features outperformed other classifiers in detecting threatening tweets. Further, an SVM classifier using fastText pre-trained word embedding obtained the best results for the target identification task.
引用
收藏
页码:128302 / 128313
页数:12
相关论文
共 48 条
[1]   Towards Accurate Detection of Offensive Language in Online Communication in Arabic [J].
Alakrot, Azalden ;
Murray, Liam ;
Nikolov, Nikola S. .
ARABIC COMPUTATIONAL LINGUISTICS, 2018, 142 :315-320
[2]  
Ameer I, 2020, COMPUT SIST, V24, P1159, DOI [10.13053/CyS-24-3-3476, 10.13053/cys-24-3-3476]
[3]  
[Anonymous], 2017, ICWSM
[4]  
[Anonymous], 2017, P 1 WORKSH AB LANG O
[5]  
Ashraf N., 2020, P COMP P WEB C NEW Y, P629
[6]   Deep Learning for Hate Speech Detection in Tweets [J].
Badjatiya, Pinkesh ;
Gupta, Shashank ;
Gupta, Manish ;
Varma, Vasudeva .
WWW'17 COMPANION: PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 2017, :759-760
[7]   Cyberbullying detection on twitter using Big Five and Dark Triad features [J].
Balakrishnan, Vimala ;
Khan, Shahzaib ;
Fernandez, Terence ;
Arabnia, Hamid R. .
PERSONALITY AND INDIVIDUAL DIFFERENCES, 2019, 141 :252-257
[8]  
Behzadan V, 2018, IEEE INT CONF BIG DA, P5002, DOI 10.1109/BigData.2018.8622506
[9]  
Bojanowski Piotr., 2017, Transactions of the Association for Computational Linguistics, V5, P135, DOI [10.1162/tacla00051, DOI 10.1162/TACLA00051]
[10]   Us and them: identifying cyber hate on Twitter across multiple protected characteristics [J].
Burnap, Pete ;
Williams, Matthew L. .
EPJ DATA SCIENCE, 2016, 5