Multimodal hate speech detection: a novel deep learning framework for multilingual text and images

被引:0
作者
Saddozai, Furqan Khan [1 ]
Badri, Sahar K. [2 ]
Alghazzawi, Daniyal [2 ]
Khattak, Asad [3 ]
Asghar, Muhammad Zubair [1 ]
机构
[1] Gomal Research Institute of Computing, Faculty of Computing, Gomal University, KP, D.I.Khan
[2] Information Systems Department, Faculty of Computing and Information Technology, King Abdul Aziz University, Jeddah
[3] College of Technological Innovation, Zayed University, Abu Dhabi Campus, Abu Dhabi
关键词
BiLSTM; Deep learning; EfficientNetB1; Hate speech; Image; Multilingual; Multimodal; Urdu-English;
D O I
10.7717/peerj-cs.2801
中图分类号
学科分类号
摘要
The rapid proliferation of social media platforms has facilitated the expression of opinions but also enabled the spread of hate speech. Detecting multimodal hate speech in low-resource multilingual contexts poses significant challenges. This study presents a deep learning framework that integrates bidirectional long short-term memory (BiLSTM) and EfficientNetB1 to classify hate speech in Urdu-English tweets, leveraging both text and image modalities. We introduce multimodal multilingual hate speech (MMHS11K), a manually annotated dataset comprising 11,000 multimodal tweets. Using an early fusion strategy, text and image features were combined for classification. Experimental results demonstrate that the BiLSTM+EfficientNetB1 model outperforms unimodal and baseline multimodal approaches, achieving an F1-score of 81.2% for Urdu tweets and 75.5% for English tweets. This research addresses critical gaps in multilingual and multimodal hate speech detection, offering a foundation for future advancements. © 2025 Saddozai et al.
引用
收藏
相关论文
共 45 条
  • [21] Kiela D, Grave E, Joulin A, Mikolov T., Efficient large-scale multi-modal classification, Proceedings of the AAAI Conference on Artificial Intelligence, (2018)
  • [22] Krishna Adithya V, Williams BM, Czanner S, Kavitha S, Friedman DS, Willoughby CE, Czanner G., EffUnet-SpaGen: an efficient and spatial generative approach to glaucoma detection, Journal of Imaging, 7, 6, (2021)
  • [23] Lan ZZ, Bao L, Yu SI, Liu W, Hauptmann AG., Multimedia classification and event detection using double fusion, Multimedia Tools and Applications, 2014, 71, pp. 333-347, (2014)
  • [24] Lu J, Batra D, Parikh D, Lee S., VilBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, Advances in Neural Information Processing Systems, 32, (2019)
  • [25] Mahajan E, Mahajan H, Kumar S., EnsMulHateCyb: multilingual hate speech and cyberbully detection in online social media, Expert Systems with Applications, 236, 2, (2024)
  • [26] Malik JS, Qiao H, Pang G, van den Hengel A., Deep learning for hate speech detection: a comparative study, International Journal of Data Science and Analytics, 39, 1, pp. 1-16, (2024)
  • [27] Malinowski M, Rohrbach M, Fritz M., Ask your neurons: a neural-based approach to answering questions about images, Proceedings of the IEEE International Conference on Computer Vision, pp. 1-9, (2015)
  • [28] Mazari AC, Boudoukhani N, Djeffal A., BERT-based ensemble learning for multi-aspect hate speech detection, Cluster Computing, 27, 1, pp. 325-339, (2024)
  • [29] Mutanga R., A comparative study of deep learning algorithms for hate speech detection on Twitter, (2021)
  • [30] Palmadottir RV, Kalenikova Z., Hate speech an overview and recommendations for combating it, Icelandic Human Rights Centre, pp. 1-27, (2018)