A Submodular Optimization Framework for Imbalanced Text Classification With Data Augmentation

被引:1
作者
Alemayehu, Eyor [1 ]
Fang, Yi [1 ]
机构
[1] Santa Clara Univ, Dept Comp Sci & Engn, Santa Clara, CA 95053 USA
关键词
Data augmentation; Data models; Optimization; Predictive models; Perturbation methods; Task analysis; Greedy algorithms; text classification; imbalanced datasets; submodular optimization;
D O I
10.1109/ACCESS.2023.3267669
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the domain of text classification, imbalanced datasets are a common occurrence. The skewed distribution of the labels of these datasets poses a great challenge to the performance of text classifiers. One popular way to mitigate this challenge is to augment underwhelmingly represented labels with synthesized items. The synthesized items are generated by data augmentation methods that can typically generate an unbounded number of items. To select the synthesized items that maximize the performance of text classifiers, we introduce a novel method that selects items that jointly maximize the likelihood of the items belonging to their respective labels and the diversity of the selected items. Our proposed method formulates the joint maximization as a monotone submodular objective function, whose solution can be approximated by a tractable and efficient greedy algorithm. We evaluated our method on multiple real-world datasets with different data augmentation techniques and text classifiers and compared results with several baselines. The experimental results demonstrate the effectiveness and efficiency of the proposed method.
引用
收藏
页码:41680 / 41696
页数:17
相关论文
共 50 条
[41]  
Tifrea A., 2018, PROC INT C LEARN REP
[42]  
Wang C., 2019, PROC TREC, P1
[43]  
Wang DL, 2019, PR MACH LEARN RES, V97
[44]  
Wei J, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P6382
[45]   Conditional BERT Contextual Augmentation [J].
Wu, Xing ;
Lv, Shangwen ;
Zang, Liangjun ;
Han, Jizhong ;
Hu, Songlin .
COMPUTATIONAL SCIENCE - ICCS 2019, PT IV, 2019, 11539 :84-95
[46]  
Xie Qizhe, 2020, Advances in Neural Information Processing Systems, V33
[47]  
Yelp, YELP OP DAT
[48]   Hierarchical Data Augmentation and the Application in Text Classification [J].
Yu, Shujuan ;
Yang, Jie ;
Liu, Danlei ;
Li, Runqi ;
Zhang, Yun ;
Zhao, Shengmei .
IEEE ACCESS, 2019, 7 :185476-185485
[49]  
Zhu C, 2020, Arxiv, DOI arXiv:1909.11764
[50]   On the performance of Matthews correlation coefficient (MCC) for imbalanced dataset [J].
Zhu, Qiuming .
PATTERN RECOGNITION LETTERS, 2020, 136 :71-80