Token replacement-based data augmentation methods for hate speech detection

被引:0
|
作者
Kosisochukwu Judith Madukwe
Xiaoying Gao
Bing Xue
机构
[1] Victoria University of Wellington,School of Engineering and Computer Science
来源
World Wide Web | 2022年 / 25卷
关键词
Hate speech data; Data augmentation; Token substitution; Word replacement; Data generation; Text data;
D O I
暂无
中图分类号
学科分类号
摘要
Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the lack of diversity and the diminutive class of interest in the dataset which results in overfitted models that do not generalize well on other or newly collected data. The different ways of handling these issues include augmenting the data with diverse samples, engineering non-redundant features or designing robust classification models. In this study, the focus is on the data augmentation aspect. Data augmentation is a popular method for improving the quality of existing datasets by generating synthetic samples that mimic the distribution of the original samples. There is a lack of extensive studies on how hate speech texts respond to varying textual data augmentation techniques and methods. Specifically, we provide further insight into the token replacement method of textual data augmentation by performing empirical studies that investigate which embedding method(s) is a robust source of synonym for replacement process, what effective method(s) can be used to select words to be replaced, and how to confirm if the label within each class is preserved. Our proposed methods, validated on two commonly used hate speech datasets affected by a known lack of diversity and diminutive class of interest issues, significantly improve classification performance and provides insights into token replacement methods.
引用
收藏
页码:1129 / 1150
页数:21
相关论文
共 50 条
  • [41] Data augmentation method based on three-dimensional measurement for silent speech recognition
    Ota, Kenko
    ACOUSTICAL SCIENCE AND TECHNOLOGY, 2024, 45 (06) : 329 - 332
  • [42] CycleGAN-based Emotion Style Transfer as Data Augmentation for Speech Emotion Recognition
    Bao, Fang
    Neumann, Michael
    Ngoc Thang Vu
    INTERSPEECH 2019, 2019, : 2828 - 2832
  • [43] AUDITORY-BASED DATA AUGMENTATION FOR END-TO-END AUTOMATIC SPEECH RECOGNITION
    Tu, Zehai
    Deadman, Jack
    Ma, Ning
    Barker, Jon
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7447 - 7451
  • [44] Enhanced Speech Emotion Recognition Using Conditional-DCGAN-Based Data Augmentation
    Roh, Kyung-Min
    Lee, Seok-Pil
    APPLIED SCIENCES-BASEL, 2024, 14 (21):
  • [45] On the Effectiveness of Neural Text Generation Based Data Augmentation for Recognition of Morphologically Rich Speech
    Tarjan, Balazs
    Szaszak, Gyorgy
    Fegyo, Tibor
    Mihajlik, Peter
    TEXT, SPEECH, AND DIALOGUE (TSD 2020), 2020, 12284 : 437 - 445
  • [46] Research on Data Augmentation Techniques for Text Classification Based on Antonym Replacement and Random Swapping
    Wang, Shaoyan
    Xiang, Yu
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON MODELING, NATURAL LANGUAGE PROCESSING AND MACHINE LEARNING, CMNM 2024, 2024, : 103 - 108
  • [47] Use of Data Augmentation Techniques in Detection of Antisocial Behavior Using Deep Learning Methods
    Maslej-Kresnakova, Viera
    Sarnovsky, Martin
    Jackova, Julia
    FUTURE INTERNET, 2022, 14 (09):
  • [48] Small target detection algorithm based on attention mechanism and data augmentation
    Wang, Jiuxin
    Liu, Man
    Su, Yaoheng
    Yao, Jiahui
    Du, Yurong
    Zhao, Minghu
    Lu, Dingze
    SIGNAL IMAGE AND VIDEO PROCESSING, 2024, 18 (04) : 3837 - 3853
  • [49] Lung Nodule Detection System Based on Data Augmentation and Attention Mechanism
    Li Y.
    Gao S.
    Beijing Youdian Daxue Xuebao/Journal of Beijing University of Posts and Telecommunications, 2022, 45 (04): : 25 - 30
  • [50] Ada: Adversarial learning based data augmentation for malicious users detection
    Wang, Jia
    Gao, Min
    Wang, Zongwei
    Lin, Chenghua
    Zhou, Wei
    Wen, Junhao
    APPLIED SOFT COMPUTING, 2022, 117