Multi-label text classification on unbalanced Twitter with monolingual model and hyperparameter optimization for hate speech and abusive language detection

被引:0
作者
Alzahrani, Ahmad A. [1 ]
Bramantoro, Arif [2 ]
Permana, Asep [3 ]
机构
[1] King Abdulaziz Univ, Fac Comp & Informat Technol, Jeddah, Saudi Arabia
[2] Univ Teknol Brunei, Sch Comp & Informat, Bandar Seri Begawan, Brunei
[3] Univ Budi Luhur, Fac Informat Technol, Jakarta, Indonesia
来源
INTERNATIONAL JOURNAL OF ADVANCED AND APPLIED SCIENCES | 2024年 / 11卷 / 05期
关键词
Hate speech; Abusive language; Imbalanced dataset; Multi-label text classification; Hyperparameter optimization;
D O I
10.21833/ijaas.2024.05.019
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The increase in hate speech and abusive language on social media leads to uncomfortable interactions among users. Many datasets available publicly that address hate speech and abusive language are not balanced, particularly those from Indonesian Twitter. To develop a more effective classification model that also considers minority classes, we needed to optimize the hyperparameters of a monolingual model, use four different data preprocessing scenarios, and improve the treatment of slang words. We assessed the model's effectiveness by its accuracy, achieving 81.38%. This result came from optimizing hyperparameters, processing data without stemming and removing stop words, and enhancing the slang word data. The optimal hyperparameters were a learning rate of 4e-5, a batch size of 16, and a dropout rate of 0.1. However, using too much dropout can decrease the model's performance and its ability to predict less common categories, such as physical- and gender-related hate speech.
引用
收藏
页码:177 / 185
页数:9
相关论文
共 24 条
[1]  
Alfina I, 2017, INT C ADV COMP SCI I, P233, DOI 10.1109/ICACSIS.2017.8355039
[2]   Classification of divorce causes during the COVID-19 pandemic using convolutional neural networks [J].
Bramantoro, Arif ;
Virdyna, Inge .
PEERJ COMPUTER SCIENCE, 2022, 8
[3]  
El Kafrawy P., 2015, INT J COMPUT APPL, V114, P1, DOI [DOI 10.5120/20083-1666, 10.5120/20083-1666]
[4]  
Fernandez A., 2018, Learning From Imbalanced Data Sets, V1st, DOI DOI 10.1007/978-3-319-98074-4
[5]  
Hana KM, 2020, P INT C DAT SCI APPL, P1, DOI [10.1109/ICoDSA50139.2020.9212992, DOI 10.1109/ICODSA50139.2020.9212992]
[6]  
Hendrawan R., 2020, 2020 INT C DAT SCI I, P1
[7]  
Hinton G. E., 2012, arXiv, DOI [10.48550/arXiv.1207.0580, DOI 10.48550/ARXIV.1207.0580]
[8]  
Ibrohim MO, 2019, THIRD WORKSHOP ON ABUSIVE LANGUAGE ONLINE, P46
[9]   Survey on deep learning with class imbalance [J].
Johnson, Justin M. ;
Khoshgoftaar, Taghi M. .
JOURNAL OF BIG DATA, 2019, 6 (01)
[10]  
Kingma Diederik P, 2014, ARXIV PREPRINT ARXIV