Token replacement-based data augmentation methods for hate speech detection

被引:0
|
作者
Kosisochukwu Judith Madukwe
Xiaoying Gao
Bing Xue
机构
[1] Victoria University of Wellington,School of Engineering and Computer Science
来源
World Wide Web | 2022年 / 25卷
关键词
Hate speech data; Data augmentation; Token substitution; Word replacement; Data generation; Text data;
D O I
暂无
中图分类号
学科分类号
摘要
Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the lack of diversity and the diminutive class of interest in the dataset which results in overfitted models that do not generalize well on other or newly collected data. The different ways of handling these issues include augmenting the data with diverse samples, engineering non-redundant features or designing robust classification models. In this study, the focus is on the data augmentation aspect. Data augmentation is a popular method for improving the quality of existing datasets by generating synthetic samples that mimic the distribution of the original samples. There is a lack of extensive studies on how hate speech texts respond to varying textual data augmentation techniques and methods. Specifically, we provide further insight into the token replacement method of textual data augmentation by performing empirical studies that investigate which embedding method(s) is a robust source of synonym for replacement process, what effective method(s) can be used to select words to be replaced, and how to confirm if the label within each class is preserved. Our proposed methods, validated on two commonly used hate speech datasets affected by a known lack of diversity and diminutive class of interest issues, significantly improve classification performance and provides insights into token replacement methods.
引用
收藏
页码:1129 / 1150
页数:21
相关论文
共 50 条
  • [31] Adversarial Data Augmentation for HMM-Based Anomaly Detection
    Castellini, Alberto
    Masillo, Francesco
    Azzalini, Davide
    Amigoni, Francesco
    Farinelli, Alessandro
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (12) : 14131 - 14143
  • [32] Prior distributions-based data augmentation for object detection
    Sun, Ke
    Luo, Xiangfeng
    Ma, Liyan
    Zhu, Shixiong
    INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING, 2022, 25 (01) : 34 - 43
  • [33] A Learning-based Data Augmentation for Network Anomaly Detection
    Al Olaimat, Mohammad
    Lee, Dongeun
    Kim, Youngsoo
    Kim, Jonghyun
    Kim, Jinoh
    2020 29TH INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATIONS AND NETWORKS (ICCCN 2020), 2020,
  • [34] Molecular communication data augmentation and deep learning based detection
    Scazzoli, Davide
    Vakilipoor, Fardad
    Magarini, Maurizio
    NANO COMMUNICATION NETWORKS, 2024, 40
  • [35] Pneumonia detection in X-ray chest images based on convolutional neural networks and data augmentation methods
    Garstka, Jakub
    Strzelecki, Michal
    2020 SIGNAL PROCESSING - ALGORITHMS, ARCHITECTURES, ARRANGEMENTS, AND APPLICATIONS (SPA), 2020, : 18 - 23
  • [36] Handling emotional speech: a prosody based data augmentation technique for improving neutral speech trained ASR systems
    Kammili, Pavan Raju
    Raju, B. H. V. S. Ramakrishnam
    Krishna, A. Sri
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2022, 25 (01) : 197 - 204
  • [37] Handling emotional speech: a prosody based data augmentation technique for improving neutral speech trained ASR systems
    Pavan Raju Kammili
    B. H. V. S. Ramakrishnam Raju
    A. Sri Krishna
    International Journal of Speech Technology, 2022, 25 : 197 - 204
  • [38] Data-Augmentation for Deep Learning Based Remote Photoplethysmography Methods
    Perche, Simon
    Botina, Deivid
    Benezeth, Yannick
    Nakamura, Keisuke
    Gomez, Randy
    Miteran, Johel
    2021 INTERNATIONAL CONFERENCE ON E-HEALTH AND BIOENGINEERING (EHB 2021), 9TH EDITION, 2021,
  • [39] Data Augmentation Techniques for Transfer Learning-Based Continuous Dysarthric Speech Recognition
    Celin, T. A. Mariya
    Vijayalakshmi, P.
    Nagarajan, T.
    CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2023, 42 (01) : 601 - 622
  • [40] Data Augmentation Techniques for Transfer Learning-Based Continuous Dysarthric Speech Recognition
    T. A. Mariya Celin
    P. Vijayalakshmi
    T. Nagarajan
    Circuits, Systems, and Signal Processing, 2023, 42 : 601 - 622