TextCut: A Multi-region Replacement Data Augmentation Approach for Text Imbalance Classification

被引:1
作者
Jiang, Wanrong [1 ]
Chen, Ya [1 ]
Ri, Hao [1 ]
Liu, Guiquan [1 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei, Peoples R China
来源
NEURAL INFORMATION PROCESSING, ICONIP 2021, PT IV | 2021年 / 13111卷
关键词
Data imbalance; Data augmentation; Text classification;
D O I
10.1007/978-3-030-92273-3_35
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the practical applications of text classification, data imbalance problems occur frequently, which typically leads to prejudice of a classifier against the majority group. Therefore, how to handle imbalanced text datasets to alleviate the skew distribution is a crucial task. Existing mainstream methods tackle it by utilizing interpolation-based augmentation strategies to synthesize new texts according to minority class texts. However, it may mess up the syntactic and semantic information of the original texts, which makes it challenging to model the new texts. We propose a novel data augmentation method based on paired samples, called TextCut, to overcome the above problem. For a minority class text and its paired text, TextCut samples multiple small square regions of the minority text in the hidden space and replaces them with corresponding regions cutout from the paired text. We build TextCut upon the BERT model to better capture the features of minority class texts. We verify that TextCut can further improve the classification performance of the minority and entire categories, and effectively alleviate the imbalanced problem on three benchmark imbalanced text datasets.
引用
收藏
页码:427 / 439
页数:13
相关论文
共 32 条
[1]  
Andreas J., 2020, P 58 ANN M ASS COMP, P7556, DOI DOI 10.18653/V1/2020.ACL-MAIN.676
[2]  
Bojanowski P., 2017, Transactions of the association for computational linguistics, V5, P135, DOI DOI 10.1162/TACL_A_00051
[3]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[4]  
Chen JA, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P2147
[5]  
Croce D, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P2114
[6]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]   A multiple resampling method for learning from imbalanced data sets [J].
Estabrooks, A ;
Jo, TH ;
Japkowicz, N .
COMPUTATIONAL INTELLIGENCE, 2004, 20 (01) :18-36
[8]  
Fernando C., 2017, arXiv
[9]  
Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672
[10]   Learning from class-imbalanced data: Review of methods and applications [J].
Guo Haixiang ;
Li Yijing ;
Shang, Jennifer ;
Gu Mingyun ;
Huang Yuanyue ;
Bing, Gong .
EXPERT SYSTEMS WITH APPLICATIONS, 2017, 73 :220-239