Deep learning for religious and continent-based toxic content detection and classification

被引:14
作者
Abbasi, Ahmed [1 ]
Javed, Abdul Rehman [2 ,3 ]
Iqbal, Farkhund [4 ]
Kryvinska, Natalia [5 ]
Jalil, Zunera [1 ]
机构
[1] Air Univ, Dept Creat Technol, PAF Complex,E-9, Islamabad, Pakistan
[2] Air Univ, Dept Cyber Secur, PAF Complex,E-9, Islamabad, Pakistan
[3] Lebanese Amer Univ, Dept Elect & Comp Engn, Byblos, Lebanon
[4] Zayed Univ, Coll Technol Innovat, Abu Dhabi, U Arab Emirates
[5] Comenius Univ, Fac Management, Informat Syst Dept, Odbojarov 10, Bratislava 82005 25, Slovakia
关键词
D O I
10.1038/s41598-022-22523-3
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
With time, numerous online communication platforms have emerged that allow people to express themselves, increasing the dissemination of toxic languages, such as racism, sexual harassment, and other negative behaviors that are not accepted in polite society. As a result, toxic language identification in online communication has emerged as a critical application of natural language processing. Numerous academic and industrial researchers have recently researched toxic language identification using machine learning algorithms. However, Nontoxic comments, including particular identification descriptors, such as Muslim, Jewish, White, and Black, were assigned unrealistically high toxicity ratings in several machine learning models. This research analyzes and compares modern deep learning algorithms for multilabel toxic comments classification. We explore two scenarios: the first is a multilabel classification of Religious toxic comments, and the second is a multilabel classification of race or toxic ethnicity comments with various word embeddings (GloVe, Word2vec, and FastText) without word embeddings using an ordinary embedding layer. Experiments show that the CNN model produced the best results for classifying multilabel toxic comments in both scenarios. We compared the outcomes of these modern deep learning model performances in terms of multilabel evaluation metrics.
引用
收藏
页数:12
相关论文
共 60 条
[1]   ElStream: An Ensemble Learning Approach for Concept Drift Detection in Dynamic Social Big Data Stream Learning [J].
Abbasi, Ahmad ;
Javed, Abdul Rehman ;
Chakraborty, Chinmay ;
Nebhen, Jamel ;
Zehra, Wisha ;
Jalil, Zunera .
IEEE ACCESS, 2021, 9 :66408-66419
[2]  
Aken BV, 2018, Arxiv, DOI [arXiv:1809.07572, DOI 10.48550/ARXIV.1809.07572, 10.18653/v1/W18-5105]
[3]  
Alfina I, 2017, INT C ADV COMP SCI I, P233, DOI 10.1109/ICACSIS.2017.8355039
[4]  
[Anonymous], 2008, P 25 INT C MACH LEAR, DOI DOI 10.1145/1390156.1390177
[5]  
Athiwaratkun B, 2018, Arxiv, DOI arXiv:1806.02901
[6]  
Bashar M. A., 2020, arXiv
[7]   A neural probabilistic language model [J].
Bengio, Y ;
Ducharme, R ;
Vincent, P ;
Jauvin, C .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (06) :1137-1155
[8]   Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification [J].
Borkan, Daniel ;
Dixon, Lucas ;
Sorensen, Jeffrey ;
Thain, Nithum ;
Vasserman, Lucy .
COMPANION OF THE WORLD WIDE WEB CONFERENCE (WWW 2019 ), 2019, :491-500
[9]   Cyber Hate Speech on Twitter: An Application of Machine Classification and Statistical Modeling for Policy and Decision Making [J].
Burnap, Pete ;
Williams, Matthew L. .
POLICY AND INTERNET, 2015, 7 (02) :223-242
[10]  
Burstein J, 2019, P 2019 C N AM CHAPT, V1