Improving Hate Speech Detection: The Impact of Semantic Representations and Preprocessing Techniques

被引:0
作者
Bolucu, Necva [1 ]
Ozerdem, Aysegul [1 ]
机构
[1] CSIRO, DATA61, Sydney, Australia
来源
2023 31ST SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, SIU | 2023年
关键词
social media; hate speech; semantic; API; preprocessing;
D O I
10.1109/SIU59756.2023.10224051
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Social Media is one of the important tools that can be used to measure the pulse of a society. However, when hate speech targeting an individual or group is produced through this tool, this situation becomes a phenomenon that can lead to social problems. In this context, the detection of hate speech is crucial. In this study, which is proposed for the hate speech detection shared task at SIU 2023 NST, the importance of semantic representations obtained through the OpenAI API is investigated in order to detect hate speech effectively. As preprocessing steps, the normalization of the dataset, an emoji dictionary, and SMOTE technic for the problem of imbalanced dataset have been applied. To demonstrate the importance of this step for the problem, basic machine learning techniques, SVM and cosine similarity, are being utilized. The experimental results show that the semantic representations offer a successful solution to the problem with machine learning models. In particular, the solution of the preprocessing step applied for the imbalanced dataset has a great contribution to the problem.
引用
收藏
页数:4
相关论文
共 19 条
  • [1] Deep Learning for Hate Speech Detection in Tweets
    Badjatiya, Pinkesh
    Gupta, Shashank
    Gupta, Manish
    Varma, Vasudeva
    [J]. WWW'17 COMPANION: PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 2017, : 759 - 760
  • [2] Beyhan F, 2022, LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P4177
  • [3] CHINCHOR N, 1992, FOURTH MESSAGE UNDERSTANDING CONFERENCE (MUC-4), P30
  • [4] Dagasan T., 2019, Master's thesis
  • [5] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [6] SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary
    Fernandez, Alberto
    Garcia, Salvador
    Herrera, Francisco
    Chawla, Nitesh V.
    [J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2018, 61 : 863 - 905
  • [7] González-Carvajal S, 2021, Arxiv, DOI arXiv:2005.13012
  • [8] Support vector machines
    Hearst, MA
    [J]. IEEE INTELLIGENT SYSTEMS & THEIR APPLICATIONS, 1998, 13 (04): : 18 - 21
  • [9] Husunbeyi Z. M., 2020, Ph.D. dissertation
  • [10] Kamalloo E, 2023, Arxiv, DOI arXiv:2305.06300