Token replacement-based data augmentation methods for hate speech detection

被引：0

作者：

Kosisochukwu Judith Madukwe

Xiaoying Gao

Bing Xue

机构：

[1] Victoria University of Wellington,School of Engineering and Computer Science

来源：

World Wide Web | 2022年 / 25卷

关键词：

Hate speech data; Data augmentation; Token substitution; Word replacement; Data generation; Text data;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the lack of diversity and the diminutive class of interest in the dataset which results in overfitted models that do not generalize well on other or newly collected data. The different ways of handling these issues include augmenting the data with diverse samples, engineering non-redundant features or designing robust classification models. In this study, the focus is on the data augmentation aspect. Data augmentation is a popular method for improving the quality of existing datasets by generating synthetic samples that mimic the distribution of the original samples. There is a lack of extensive studies on how hate speech texts respond to varying textual data augmentation techniques and methods. Specifically, we provide further insight into the token replacement method of textual data augmentation by performing empirical studies that investigate which embedding method(s) is a robust source of synonym for replacement process, what effective method(s) can be used to select words to be replaced, and how to confirm if the label within each class is preserved. Our proposed methods, validated on two commonly used hate speech datasets affected by a known lack of diversity and diminutive class of interest issues, significantly improve classification performance and provides insights into token replacement methods.

引用

页码：1129 / 1150

页数：21

共 50 条

[11] A comparison of data augmentation methods in voice pathology detection
Javanmardi, Farhad
Kadiri, Sudarsana Reddy
Alku, Paavo
COMPUTER SPEECH AND LANGUAGE, 2023, 83
[12] FRAUG: A FRAME RATE BASED DATA AUGMENTATION METHOD FOR DEPRESSION DETECTION FROM SPEECH SIGNALS
Ravi, Vijay
Wang, Jinhan
Flint, Jonathan
Alwan, Abeer
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6267 - 6271
[13] Similar target replacement for remote sensing object detection data augmentation
Sun, Deyao
Zhu, Ming
Wang, Jiarong
CHINESE JOURNAL OF LIQUID CRYSTALS AND DISPLAYS, 2024, 39 (06) : 813 - 821
[14] Data augmentation using CycleGAN-based methods for automatic bridge crack detection
Li, Baoxian
Guo, Hongbin
Wang, Zhanfei
STRUCTURES, 2024, 62
[15] Comparative study of data augmentation methods for fake audio detection
Park, KwanYeol
Kwak, Il-Youp
KOREAN JOURNAL OF APPLIED STATISTICS, 2023, 36 (02) : 101 - 114
[16] SNR-Selection-Based-Data Augmentation for Dysarthric Speech Recognition
Nawroly, Sarkhell Sirwan
Popescu, Decebal Gheorghe
Antony, Mariya Celin Thekekara
Philominal, Actlin Jeeva Muthu
STUDIES IN INFORMATICS AND CONTROL, 2023, 32 (04): : 129 - 140
[17] Data Augmentation Based on Frequency Warping for Recognition of Cleft Palate Speech
Fujiwara, Kento
Takashima, Ryoichi
Sugiyama, Chihiro
Tanaka, Nobukazu
Nohara, Kanji
Nozaki, Kazunori
Takiguchi, Tetsuya
2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 471 - 476
[18] Unsupervised Anomaly Detection Based on Data Augmentation and Mixing
Ishida, Naoya
Nagatsu, Yuki
Hashimoto, Hideki
IECON 2020: THE 46TH ANNUAL CONFERENCE OF THE IEEE INDUSTRIAL ELECTRONICS SOCIETY, 2020, : 529 - 533
[19] Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation
Tao, Huawei
Shan, Shuai
Hu, Ziyi
Zhu, Chunhua
Ge, Hongyi
ENTROPY, 2023, 25 (01)
[20] Autoencoder-based Data Augmentation for Deepfake Detection
Stanciu, Dan-Cristian
Ionescu, Bogdan
PROCEEDINGS OF THE 2ND ACM INTERNATIONAL WORKSHOP ON MULTIMEDIA AI AGAINST DISCRIMINATION, MAD 2023, 2023, : 19 - 27

← 1 2 3 4 5 →