Automatic Detection of Offensive Language for Urdu and Roman Urdu

被引:61
作者
Akhter, Muhammad Pervez [1 ]
Zheng Jiangbin [1 ]
Naqvi, Irfan Raza [1 ]
Abdelmajeed, Mohammed [2 ]
Sadiq, Muhammad Tariq [3 ]
机构
[1] Northwestern Polytech Univ, Sch Software & Microelect, Xian 710072, Peoples R China
[2] Northwestern Polytech Univ, Sch Comp Sci & Technol, Xian 710072, Peoples R China
[3] Northwestern Polytech Univ, Sch Automat, Xian 710072, Peoples R China
基金
中国国家自然科学基金;
关键词
Machine learning; YouTube; Feature extraction; Videos; Writing; Twitter; Social media; offensive language detection; natural language Processing; machine learning; text processing; ONLINE COMMUNICATION; HATE SPEECH; CLASSIFICATION; TWITTER;
D O I
10.1109/ACCESS.2020.2994950
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, unethical behavior in the cyber-environment has been revealed. The presence of offensive language on social media platforms and automatic detection of such language is becoming a major challenge in modern society. The complexity of natural language constructs makes this task even more challenging. Until now, most of the research has focused on resource-rich languages like English. Roman Urdu and Urdu are two scripts of writing the Urdu language on social media. The Roman script uses the English language characters while the Urdu script uses Urdu language characters. Urdu and Hindi languages are similar with the only difference in their writing script but the Roman scripts of both languages are similar. This study is about the detection of offensive language from the users comments presented in a resource-poor language Urdu. We propose the first offensive dataset of Urdu containing user-generated comments from social media. We use individual and combined n-grams techniques to extract features at character-level and word-level. We apply seventeen classifiers from seven machine learning techniques to detect offensive language from both Urdu and Roman Urdu text comments. Experiments show that the regression-based models using character n-grams show superior performance to process the Urdu language. Character-level tri-gram outperforms the other word and character n-grams. LogitBoost and SimpleLogistic outperform the other models and achieve 99.2 and 95.9 values of F-measure on Roman Urdu and Urdu datasets respectively. Our designed dataset is publically available on GitHub for future research.
引用
收藏
页码:91213 / 91226
页数:14
相关论文
共 40 条
[1]   Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network [J].
Akhter, Muhammad Pervez ;
Jiangbin, Zheng ;
Naqvi, Irfan Raza ;
Abdelmajeed, Mohammed ;
Mehmood, Atif ;
Sadiq, Muhammad Tariq .
IEEE ACCESS, 2020, 8 :42689-42707
[2]   An Arabic text categorization approach using term weighting and multiple reducts [J].
Al-Radaideh, Qasem A. ;
Al-Abrat, Mohammed A. .
SOFT COMPUTING, 2019, 23 (14) :5849-5863
[3]   Towards Accurate Detection of Offensive Language in Online Communication in Arabic [J].
Alakrot, Azalden ;
Murray, Liam ;
Nikolov, Nikola S. .
ARABIC COMPUTATIONAL LINGUISTICS, 2018, 142 :315-320
[4]   Dataset Construction for the Detection of Anti-Social Behaviour in Online Communication in Arabic [J].
Alakrot, Azalden ;
Murray, Liam ;
Nikolov, Nikola S. .
ARABIC COMPUTATIONAL LINGUISTICS, 2018, 142 :174-181
[5]   Sentiment classification of Roman-Urdu opinions using Naive Bayesian, Decision Tree and KNN classification techniques [J].
Bilal, Muhammad ;
Israr, Huma ;
Shahid, Muhammad ;
Khan, Amin .
JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2016, 28 (03) :330-344
[6]   A Pattern-Based Approach for Sarcasm Detection on Twitter [J].
Bouazizi, Mondher ;
Otsuki , Tomoaki .
IEEE ACCESS, 2016, 4 :5477-5488
[7]   Cyber Hate Speech on Twitter: An Application of Machine Classification and Statistical Modeling for Policy and Decision Making [J].
Burnap, Pete ;
Williams, Matthew L. .
POLICY AND INTERNET, 2015, 7 (02) :223-242
[8]   Us and them: identifying cyber hate on Twitter across multiple protected characteristics [J].
Burnap, Pete ;
Williams, Matthew L. .
EPJ DATA SCIENCE, 2016, 5
[9]   Detecting Offensive Language in Social Media to Protect Adolescent Online Safety [J].
Chen, Ying ;
Zhou, Yilu ;
Zhu, Sencun ;
Xu, Heng .
PROCEEDINGS OF 2012 ASE/IEEE INTERNATIONAL CONFERENCE ON PRIVACY, SECURITY, RISK AND TRUST AND 2012 ASE/IEEE INTERNATIONAL CONFERENCE ON SOCIAL COMPUTING (SOCIALCOM/PASSAT 2012), 2012, :71-80
[10]   Urdu language processing: a survey [J].
Daud, Ali ;
Khan, Wahab ;
Che, Dunren .
ARTIFICIAL INTELLIGENCE REVIEW, 2017, 47 (03) :279-311