A two-stage balancing strategy based on data augmentation for imbalanced text sentiment classification

被引:1
作者
Pang, Zhicheng [1 ]
Li, Hong [1 ]
Wang, Chiyu [1 ]
Shi, Jiawen [1 ]
Zhou, Jiale [1 ]
机构
[1] Cent South Univ, Sch Comp Sci & Engn, Changsha, Peoples R China
基金
中国国家自然科学基金;
关键词
Imbalanced text sentiment classification; resampling; noise modification; data augmentation; word replacement; ATTENTION;
D O I
10.3233/JIFS-202716
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In practice, the class imbalance is prevalent in sentiment classification tasks, which is harmful to classifiers. Recently, over-sampling strategies based on data augmentation techniques have caught the eyes of researchers. They generate new samples by rewriting the original samples. Nevertheless, the samples to be rewritten are usually selected randomly, which means that useless samples may be selected, thus adding this type of samples. Based on this observation, we propose a novel balancing strategy for text sentiment classification. Our approach takes word replacement as foundation and can be divided into two stages, which not only can balance the class distribution of training set, but also can modify noisy data. In the first stage, we perform word replacement on specific samples instead of random samples to obtain new samples. According to the noise detection, the second stage revises the sentiment of noisy samples. Toward this aim, we propose an improved term weighting called TF-IGM-CW for imbalanced text datasets, which contributes to extracting the target rewritten samples and feature words. We conduct experiments on four public sentiment datasets. Results suggest that our method outperforms several other resampling methods and can be integrated with various classification algorithms easily.
引用
收藏
页码:10073 / 10086
页数:14
相关论文
共 36 条
[1]  
Arora Sanjeev, 2018, PMLR, P1455
[2]  
Baroni Marco, 2012, P 13 C EUROPEAN CHAP, P23
[3]  
Castelle M., 2020, ARXIV PREPRINT ARXIV
[4]   Turning from TF-IDF to TF-IGM for term weighting in text classification [J].
Chen, Kewen ;
Zhang, Zuping ;
Long, Jun ;
Zhang, Hao .
EXPERT SYSTEMS WITH APPLICATIONS, 2016, 66 :245-260
[5]   Learning in presence of class imbalance and class overlapping by using one-class SVM and undersampling technique [J].
Devi, Debashree ;
Biswas, Saroj K. ;
Purkayastha, Biswajit .
CONNECTION SCIENCE, 2019, 31 (02) :105-142
[6]   Data Augmentation for Low-Resource Neural Machine Translation [J].
Fadaee, Marzieh ;
Bisazza, Arianna ;
Monz, Christof .
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 2, 2017, :567-573
[7]   On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets [J].
Fernandez, Alberto ;
Jose del Jesus, Maria ;
Herrera, Francisco .
INFORMATION SCIENCES, 2010, 180 (08) :1268-1291
[8]   A Study of Various Text Augmentation Techniques for Relation Classification in Free Text [J].
Giridhara, Praveen Kumar Badimala ;
Mishra, Chinmaya ;
Venkataramana, Reddy Kumar Modam ;
Bukhari, Syed Saqib ;
Dengel, Andreas .
ICPRAM: PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION APPLICATIONS AND METHODS, 2019, :360-367
[9]   Learning from class-imbalanced data: Review of methods and applications [J].
Guo Haixiang ;
Li Yijing ;
Shang, Jennifer ;
Gu Mingyun ;
Huang Yuanyue ;
Bing, Gong .
EXPERT SYSTEMS WITH APPLICATIONS, 2017, 73 :220-239
[10]   Machine Learning Made Easy: A Review of Scikit-learn Package in Python']Python Programming Language [J].
Hao, Jiangang ;
Ho, Tin Kam .
JOURNAL OF EDUCATIONAL AND BEHAVIORAL STATISTICS, 2019, 44 (03) :348-361