Token replacement-based data augmentation methods for hate speech detection

被引:0
|
作者
Kosisochukwu Judith Madukwe
Xiaoying Gao
Bing Xue
机构
[1] Victoria University of Wellington,School of Engineering and Computer Science
来源
World Wide Web | 2022年 / 25卷
关键词
Hate speech data; Data augmentation; Token substitution; Word replacement; Data generation; Text data;
D O I
暂无
中图分类号
学科分类号
摘要
Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the lack of diversity and the diminutive class of interest in the dataset which results in overfitted models that do not generalize well on other or newly collected data. The different ways of handling these issues include augmenting the data with diverse samples, engineering non-redundant features or designing robust classification models. In this study, the focus is on the data augmentation aspect. Data augmentation is a popular method for improving the quality of existing datasets by generating synthetic samples that mimic the distribution of the original samples. There is a lack of extensive studies on how hate speech texts respond to varying textual data augmentation techniques and methods. Specifically, we provide further insight into the token replacement method of textual data augmentation by performing empirical studies that investigate which embedding method(s) is a robust source of synonym for replacement process, what effective method(s) can be used to select words to be replaced, and how to confirm if the label within each class is preserved. Our proposed methods, validated on two commonly used hate speech datasets affected by a known lack of diversity and diminutive class of interest issues, significantly improve classification performance and provides insights into token replacement methods.
引用
收藏
页码:1129 / 1150
页数:21
相关论文
共 50 条
  • [21] A Light CNN with Split Batch Normalization for Spoofed Speech Detection Using Data Augmentation
    Lin, Haojian
    Ai, Yang
    Ling, Zhenhua
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1684 - 1689
  • [22] Research on Recommendation Methods Based on Data Augmentation in a Heterogeneous View
    Mao, Qian
    Yu, Xiaomei
    Che, Xueyu
    Gong, Zhaokun
    Fu, Wenxiang
    Xu, Zehong
    2023 8TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYTICS, ICCCBDA, 2023, : 59 - 63
  • [23] Saliency-Based Token Swap - A Language-Agnostic Data Augmentation Method for Text Classification
    Ilangeshwaran, Hiroshan
    Abeywardhana, Lakmini
    Rathnayake, Samadhi
    2024 9TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY RESEARCH, ICITR, 2024,
  • [24] Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis
    Cong-Thanh Do
    Imai, Shuhei
    Doddipatla, Rama
    Hain, Thomas
    32ND EUROPEAN SIGNAL PROCESSING CONFERENCE, EUSIPCO 2024, 2024, : 136 - 140
  • [25] Data Augmentation Methods for End-to-end Speech Recognition on Distant-Talk Scenarios
    Tsunoo, Emiru
    Shibata, Kentaro
    Narisetty, Chaitanya
    Kashiwagi, Yosuke
    Watanabe, Shinji
    INTERSPEECH 2021, 2021, : 301 - 305
  • [26] Enhanced Speech Emotion Recognition Using DCGAN-Based Data Augmentation
    Baek, Ji-Young
    Lee, Seok-Pil
    Tsihrintzis, George A.
    ELECTRONICS, 2023, 12 (18)
  • [27] DATA AUGMENTATION BASED ON VOWEL STRETCH FOR IMPROVING CHILDREN'S SPEECH RECOGNITION
    Nagano, Tohru
    Fukuda, Takashi
    Suzuki, Masayuki
    Kurata, Gakuto
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 502 - 508
  • [28] GENERATIVE ADVERSARIAL NETWORKS BASED DATA AUGMENTATION FOR NOISE ROBUST SPEECH RECOGNITION
    Hu, Hu
    Tan, Tian
    Qian, Yanmin
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5044 - 5048
  • [29] Hyperspectral Target Detection Based on Masked Autoencoder Data Augmentation
    Zhuang, Zhixuan
    Lan, Jinhui
    Zeng, Yiliang
    REMOTE SENSING, 2025, 17 (06)
  • [30] Data augmentation in hotspot detection based on generative adversarial network
    Wang, Shuhan
    Gai, Tianyang
    Qu, Tong
    Ma, Bojie
    Su, Xiaojing
    Dong, Lisong
    Zhang, Libin
    Xu, Peng
    Su, Yajuan
    Wei, Yayi
    JOURNAL OF MICRO-NANOPATTERNING MATERIALS AND METROLOGY-JM3, 2021, 20 (03):