Comparing Bag of Words and TF-IDF with different models for hate speech detection from live tweets

被引:24
作者
Akuma S. [1 ]
Lubem T. [1 ]
Adom I.T. [1 ]
机构
[1] Department of Mathematics, Computer Science and Statistics, Benue State University, Makurdi
关键词
Bag of Words; Hate speech; Machine learning algorithm; Sentiment analysis; Social media; TF-IDF; Twitter;
D O I
10.1007/s41870-022-01096-4
中图分类号
学科分类号
摘要
Social media platforms such as Twitter have revolutionized online communication and interactions but often contain components of disdain for its growing user base. This discomforting feed creates instability leading to mental breakdown, and loss of human lives and properties among other results of misuse. Even though the problem posed by the content of social media is obvious, the challenge of detecting hateful content persists. Several algorithms and techniques have been used in the past for detecting hateful content on social media but there is room for improvement. The goal of this paper is to detect hate speech from live tweets on Twitter via a combination of mechanisms. The comparison results of Term Frequency-Inverse Document Frequency (TF-IDF) and Bag of Words (BoW) with machine learning models of Logistic Regression, Naïve Bayes, Decision Tree, and K-Nearest Neighbour (KNN), is used to select the best performing model. This model which is integrated into a web system developed with Twitter Application Programming Interface (API) is used in identifying live tweets which are hateful or not. The outcome of the comparative study presented showed that Decision Tree performed better than the other three models with an accuracy of 92.43% using TF-IDF which gives optimal results compared to BoW. © 2022, The Author(s), under exclusive licence to Bharati Vidyapeeth's Institute of Computer Applications and Management.
引用
收藏
页码:3629 / 3635
页数:6
相关论文
共 21 条
[1]  
Abro S., Shaikh S., Hussain Z., Ali Z., Khan S., Mujtaba G., Automatic hate speech detection using machine learning: A comparative study, Int J Adv Comput Sci Appl, 11, 8, pp. 484-491, (2020)
[2]  
Akuma S., Obilikwu P., Ahar E., Sentiment analysis of social media content for music recommendation, Nigerian Ann Pure Appl Sci, 4, 1, pp. 95-107, (2021)
[3]  
Andrii S., Kaggle. Dataset, (2019)
[4]  
Burnap P., Williams M.L., Us and them: Identifying cyber hate on Twitter across multiple protected characteristics, EPJ Data Sci, 5, 11, pp. 1-15, (2016)
[5]  
Das A.K., Asif A.A., Paul A., Hossain N., Bangla hate speech detection on social media using attention-based recurrent neural network, J Intell Syst, 30, 1, pp. 578-591, (2020)
[6]  
Davidson T., Warmsley D., Macy M., Weber I., Automated hate speech detection and the problem of offensive language, ICWSM, (2017)
[7]  
Fortuna P., Automatic detection of hate speech in text: An overview of the topic ad dataset annotation with hierarchical classes., (2017)
[8]  
de Souza G.A., da Costa-Abreu M., Automatic offensive language detection from Twitter data using machine learning and feature selection of metadata, In Anonymous. In: 2020 International Joint Conference on Neural Networks (IJCNN)., 2020, pp. 1-6, (2020)
[9]  
Gamback B., Sikdar U.K., Using convolutional neural networks to classify hate-speech. In Anonymous. In: Proceedings of the first workshop on abusive language online. (Vancouver, BC, Canada), Association for Computational Linguistics, pp. 85-90, (2017)
[10]  
Gao L., Detecting Online Hate Speech Using Both Supervised and Weakly-Supervised Approaches, (2018)