Cyberbullying Text Identification: A Deep Learning and Transformer-based Language Modeling Approach

被引：0

作者：

Saifullah K. ^{[1
]}

Khan M.I. ^{[1
]}

Jamal S. ^{[2
]}

Sarker I.H. ^{[3
]}

机构：

[1] Department of Computer Science and Engineering, Chittagong University of Engineering and Technology, Chittagong

[2] Dept. of Information Technology, Georgia Southern University, Statesboro, GA

[3] Centre for Securing Digital Futures, School of Science, Edith Cowan University, Perth, 6027, WA

来源：

EAI Endorsed Transactions on Industrial Networks and Intelligent Systems | 2024年 / 11卷 / 01期

关键词：

Cyberbullying; deep learning; fine tuning; harmful messages; large language modeling; natural language processing (NLP); OOV; transformers models;

D O I：

10.4108/EETINIS.V11I1.4703

中图分类号：

学科分类号：

摘要：

In the contemporary digital age, social media platforms like Facebook, Twitter, and YouTube serve as vital channels for individuals to express ideas and connect with others. Despite fostering increased connectivity, these platforms have inadvertently given rise to negative behaviors, particularly cyberbullying. While extensive research has been conducted on high-resource languages such as English, there is a notable scarcity of resources for low-resource languages like Bengali, Arabic, Tamil, etc., particularly in terms of language modeling. This study addresses this gap by developing a cyberbullying text identification system called BullyFilterNeT tailored for social media texts, considering Bengali as a test case. The intelligent BullyFilterNeT system devised overcomes Out-of-Vocabulary (OOV) challenges associated with non-contextual embeddings and addresses the limitations of context-aware feature representations. To facilitate a comprehensive understanding, three non-contextual embedding models GloVe, FastText, and Word2Vec are developed for feature extraction in Bengali. These embedding models are utilized in the classification models, employing three statistical models (SVM, SGD, Libsvm), and four deep learning models (CNN, VDCNN, LSTM, GRU). Additionally, the study employs six transformer-based language models: mBERT, bELECTRA, IndicBERT, XML-RoBERTa, DistilBERT, and BanglaBERT, respectively to overcome the limitations of earlier models. Remarkably, BanglaBERT-based BullyFilterNeT achieves the highest accuracy of 88.04% in our test set, underscoring its effectiveness in cyberbullying text identification in the Bengali language. Copyright © 2024 K. Saifullah et al., licensed to EAI. This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.

引用

页码：1 / 12

页数：11

共 30 条

[1]

Abdhullah-Al-Mamun, Akhter Shahin, Social media bullying detection using machine learning on bangla text, 2018 10th International Conference on Electrical and Computer Engineering (ICECE), pp. 385-388, (2018)

[2]

Afroze Sadia, Hoque Mohammed Moshiul, Sntiemd: Sentiment specific embedding model generation and evaluation for a resource constraint language, Intelligent Computing & Optimization, pp. 242-252, (2023)

[3]

Ahmed Md. Tofael, Rahman Maqsudur, Nur Shafayet, Islam Azm, Das Dipankar, Deployment of machine learning and deep learning algorithms in detecting cyberbullying in bangla and romanized bangla text: A comparative study, 2021 International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT), pp. 1-10, (2021)

[4]

Akhter Arnisha, Acharjee Uzzal Kumar, Talukder Md. Alamin, Manowarul Islam Md., Uddin Md Ashraf, A robust hybrid machine learning model for bengali cyber bullying detection in social media, Natural Language Processing Journal, 4, (2023)

[5]

Azmin Sara, Dhar Kingshuk, Emotion detection from bangla text corpus using naïve bayes classifier, 2019 4th International Conference on Electrical Information and Communication Technology (EICT), pp. 1-5, (2019)

[6]

Bojanowski Piotr, Grave Edouard, Joulin Armand, Mikolov Tomas, Enriching word vectors with subword information, Tran. ACL, 5, pp. 135-146, (2017)

[7]

Davidson Thomas, Warmsley Dana, Macy Michael, Weber Ingmar, Automated hate speech detection and the problem of offensive language, Proceedings of the international AAAI conference on web and social media, 11, pp. 512-515, (2017)

[8]

Mojica de la Vega Luis Gerardo, Ng Vincent, Modeling trolling in social media conversations, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), (2018)

[9]

Dewani Amirita, Memon Mohsin Ali, Bhatti Sania, Cyberbullying detection: advanced preprocessing techniques & deep learning architecture for roman urdu data, Journal of Big Data, 8, 1, (2021)

[10]

Founta Antigoni, Djouvas Constantinos, Chatzakou Despoina, Leontiadis Ilias, Blackburn Jeremy, Stringhini Gianluca, Vakali Athena, Sirivianos Michael, Kourtellis Nicolas, Large scale crowdsourcing and characterization of twitter abusive behavior, Proceedings of the international AAAI conference on web and social media, 12, (2018)

← 1 2 3 →