HATE SPEECH DETECTION IN LOW-RESOURCE BODO AND ASSAMESE TEXTS WITH ML-DL AND BERT MODELS

被引:6
作者
Ghosh, Koyel [1 ]
Senapati, Apurbalal [1 ]
Narzary, Mwnthai [1 ]
Brahma, Maharaj [2 ]
机构
[1] Cent Inst Technol, Dept Comp Sci & Engn, Kokrajhar, Assam, India
[2] IIT Hyderabad, Dept Comp Sci & Engn, Hyderabad, India
来源
SCALABLE COMPUTING-PRACTICE AND EXPERIENCE | 2023年 / 24卷 / 04期
关键词
Hate Speech Detection; Assamese; Bodo; Natural Language Processing; NLP; Machine Learning; Deep Learning; Word2Vec; NB; SVM; LSTM; BiLSTM; CNN; BERT;
D O I
10.12694/scpe.v24i4.2469
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Hate speech detection research is a recent sizzling topic in natural language processing (NLP). Unburdened uses of social media platforms make people over-opinionative, which crosses the limit of leaving comments and posts toxic. A toxic outlook increases violence towards the neighbour, state, country, and continent. Several laws have been introduced in different countries to end the emergency problem. Now, all the media platforms have started working on restricting hate posts or comments. Hate speech detection is generally a text classification problem if considered a supervised observation. To tackle text in terms of computation perspective is challenging because of its semantic and complex grammatical nature. Resource-rich languages leverage their richness, whereas resource scarce language suffers significantly from a lack of dataset. This paper makes a multifaceted contribution encompassing resource generation, experimentation with Machine Learning (ML), Deep Learning (DL) and state-of-the-art transformer-based models, and a comprehensive evaluation of model performance, including thorough error analysis. In the realm of resource generation, it adds to the North-East Indian Hate Speech tagged dataset (NEIHS version 1), which encompasses two languages: Assamese and Bodo.
引用
收藏
页码:941 / 955
页数:15
相关论文
共 42 条
  • [1] Alhajji Mohammed, 2019, Glob Pediatr Health, V6, p2333794X19868887, DOI 10.1177/2333794X19868887
  • [2] [Anonymous], 2018, P INT AAAI C WEB SOC
  • [3] Bashar M.A., 2020, arXiv
  • [4] Bhardwaj M, 2020, Arxiv, DOI arXiv:2011.03588
  • [5] Bhatia M, 2021, arXiv
  • [6] Chakravarthi Bharathi Raja, 2021, Dataset for Identification of Homophobia and Transophobia in Multilingual YouTube Comments, DOI [10.48550/arXiv.2109.00227, DOI 10.48550/ARXIV.2109.00227]
  • [7] Conneau A, 2020, Arxiv, DOI arXiv:1911.02116
  • [8] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
  • [9] ElSherief M., 2018, P INT AAAI C WEB SOC, V12, P1, DOI DOI 10.1609/ICWSM.V12I1.15041
  • [10] A decision-theoretic generalization of on-line learning and an application to boosting
    Freund, Y
    Schapire, RE
    [J]. JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1997, 55 (01) : 119 - 139