A new sampling method for classifying imbalanced data based on support vector machine ensemble

被引:98
作者
Jian, Chuanxia [1 ]
Gao, Jian [1 ]
Ao, Yinhui [1 ]
机构
[1] Guangdong Univ Technol, Sch Electromech Engn, Key Lab Mech Equipment Mfg & Control Technol, Minist Educ, Guangzhou 510006, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
Imbalanced data; Sampling; Support vector machine; CLASSIFICATION;
D O I
10.1016/j.neucom.2016.02.006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The insufficient information from the minority examples cannot exactly represent the inherent structure of the dataset, which leads to a low prediction accuracy of the minority through the existing classification methods. The over- and under-sampling methods help to increase the prediction accuracy of the minority. However, the two methods either lose important information or add trivial information for classification, so as to affect the prediction accuracy of the minority. Therefore, a new different contribution sampling method (DCS) based on the contributions of the support vectors (SVs) and the nonsupport vectors (NSVs) to classification is proposed in this paper. The proposed DCS method applies different sampling methods for the SVs and the NSVs and uses the biased support vector machine (B-SVM) method to identify the SVs and the NSVs of an imbalanced data. Moreover, the synthetic minority over sampling technique (SMOTE) and the random under-sampling technique (RUS) are used in the proposed method to re-sample the SVs in the minority and the NSVs in the majority, respectively. Examples are labeled by the ensemble of support vector machine (SVMen). Experiments are carried out on the imbalanced dataset which is selected from UCI, AVU06a, Statlog, DP01a, JP98a and CWH03a repositories. Experimental results show that for the imbalanced datasets, the proposed DCS method achieves a better performance in the aspects of Receiver Operating Characteristic (ROC) curve than other methods. The proposed DCS method improves 20.80%, 5.97%, 8.66% and 9.35% in terms of the geometric mean prediction accuracy G(mean) as compared with that achieved by using the NS, the US, the SMOTE and the ROS, respectively. (C) 2016 Elsevier B.V. All rights reserved.
引用
收藏
页码:115 / 122
页数:8
相关论文
共 37 条
[1]   DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets [J].
Alibeigi, Mina ;
Hashemi, Sattar ;
Hamzeh, Ali .
DATA & KNOWLEDGE ENGINEERING, 2012, 81-82 :67-103
[2]  
[Anonymous], 2013, Journal of Network and Innovative Computing
[3]  
Batista GE., 2004, ACM SIGKDD EXPL NEWS, V6, P20, DOI DOI 10.1145/1007730.1007735
[4]   The use of the area under the roc curve in the evaluation of machine learning algorithms [J].
Bradley, AP .
PATTERN RECOGNITION, 1997, 30 (07) :1145-1159
[5]  
Chawla NV, 2005, DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK, P853, DOI 10.1007/0-387-25465-X_40
[6]   Automatically countering imbalance and its empirical relationship to cost [J].
Chawla, Nitesh V. ;
Cieslak, David A. ;
Hall, Lawrence O. ;
Joshi, Ajay .
DATA MINING AND KNOWLEDGE DISCOVERY, 2008, 17 (02) :225-252
[7]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[8]  
Chawla NV., 2004, ACM SIGKDD EXPLORATI, V6, P1, DOI DOI 10.1145/1007730.1007733
[9]   Exploiting probabilistic topic models to improve text categorization under class imbalance [J].
Chen, Enhong ;
Lin, Yanggang ;
Xiong, Hui ;
Luo, Qiming ;
Ma, Haiping .
INFORMATION PROCESSING & MANAGEMENT, 2011, 47 (02) :202-214
[10]  
Cherkassky V, 1997, IEEE Trans Neural Netw, V8, P1564, DOI 10.1109/TNN.1997.641482