Hate speech recognition in multilingual text: hinglish documents

被引:6
作者
Yadav A.K. [1 ]
Kumar M. [1 ]
Kumar A. [1 ]
Shivani [1 ]
Kusum [1 ]
Yadav D. [2 ]
机构
[1] Department of Computer Science & Engineering, NIT Hamirpur (HP), Hamirpur
[2] School of Computer and Information Sciences, Indira Gandhi National Open University, Delhi
关键词
BiLSTM; CNN; Deep learning; Hate speech; Machine learning; Word2Vec;
D O I
10.1007/s41870-023-01211-z
中图分类号
学科分类号
摘要
The Internet is a boon for mankind but its misuse has been increasing drastically. Social networking platforms such as Facebook, Twitter and Instagram play a predominant role in expressing views by the users. Sometimes users wield abusive or inflammatory language, that may provoke readers. This paper aims to evaluate various machine learning and deep learning techniques to detect hate speech on various social media platforms in the Hinglish (English-Hindi code-mix) language. In this paper, we apply and evaluate several machine learning and deep learning methods, along with various feature extraction and word-embedding techniques, on a consolidated dataset of 20600 instances, for hate speech detection from tweets and comments in Hinglish. The experimental results reveal that deep learning models perform better than machine learning models in general. Among the deep learning models, the CNN-BiLSTM model with word2vec word embedding provides the best results. The model yields 0.876 accuracy, 0.830 precision, 0.840 recall and 0.835 F1-score. These results surpass the recent state-of-art approaches. © 2023, The Author(s), under exclusive licence to Bharati Vidyapeeth's Institute of Computer Applications and Management.
引用
收藏
页码:1319 / 1331
页数:12
相关论文
共 48 条
  • [11] Kaur S., Singh S., Kaushal S., Abusive content detection in online user-generated data: a survey, Procedia Comp Sci, 189, pp. 274-281, (2021)
  • [12] Yadav A., Vishwakarma D.K., Sentiment analysis using deep learning architectures: a review, Artifi Intel Rev, 53, 6, pp. 4335-4385, (2020)
  • [13] Drias H.H., Drias Y., Mining twitter data on COVID-19 for sentiment analysis and frequent patterns discovery, medRxiv, 18, (2020)
  • [14] Thakur V., Sahu R., Omer S., Current State of Hinglish Text Sentiment Analysis, Proceedings of the International Conference on Innovative Computing & Communications (ICICC), (2020)
  • [15] Srivastava V., Singh M., Hinge: A dataset for generation and evaluation of code-mixed hinglish text, (2021)
  • [16] Akuma S., Lubem T., Adom I.T., Comparing Bag of Words and TF-IDF with different models for hate speech detection from live tweets, International Journal of Information Technology, 14, pp. 3629-3635, (2022)
  • [17] Kumar P., Vardhan M., PWEBSA: Twitter sentiment analysis by combining Plutchik wheel of emotion and word embedding, International Journal of Information Technology, 14, pp. 69-77, (2022)
  • [18] Kumar R., Reganti A.N., Bhatia A., Maheshwari T., Aggression-annotated corpus of hindi-english code-mixed data, (2018)
  • [19] Li T., Lin L., Choi M., Fu K., Gong S., Wang J., Youtube av 50k: an annotated corpus for comments in autonomous vehicles. In:2018 international joint symposium on artificial intelligence and natural language processing (iSAI-NLP), IEEE, 2018, pp. 1-5, (2018)
  • [20] Ravi K., Ravi V., Sentiment classification of Hinglish text, In: 2016 3Rd International Conference on Recent Advances in Information Technology (RAIT)., pp. 641-645, (2016)