SVMs Modeling for Highly Imbalanced Classification

被引:640
作者
Tang, Yuchun [1 ]
Zhang, Yan-Qing [2 ]
Chawla, Nitesh V. [3 ]
Krasser, Sven [1 ]
机构
[1] McAfee Inc, Alpharetta, GA 30022 USA
[2] Georgia State Univ, Dept Comp Sci, Atlanta, GA 30302 USA
[3] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46556 USA
来源
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS | 2009年 / 39卷 / 01期
关键词
Computational intelligence; cost-sensitive learning; granular computing; highly imbalanced classification; oversampling; support vector machines (SVMs); undersampling;
D O I
10.1109/TSMCB.2008.2002909
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Traditional classification algorithms can be limited in their performance on highly unbalanced data sets. A popular stream of work for countering the problem of class imbalance has been the application of a sundry of sampling strategies. In this correspondence, we focus on designing modifications to support vector machines (SVMs) to appropriately tackle the problem of class imbalance. We incorporate different "rebalance" heuristics in SVM modeling, including cost-sensitive learning, and over- and undersampling. These SVM-based strategies are compared with various state-of-the-art approaches on a variety of data sets by using various metrics, including G-mean, area under the receiver operating characteristic curve, F-measure, and area under the precision/recall curve. We show that we am able to surpass or match the previously known best algorithms on each data set. In particular, of the four SVM variations considered in this correspondence, the novel granular SVMs-repetitive undersampling algorithm (GSVM-RU) is the best in terms of both effectiveness and efficiency. GSVM-RU is effective, as it can minimize the negative effect of information loss while maximizing the positive effect of data cleaning in the undersampling process. GSVM-RU is efficient by extracting much less support vectors and, hence, greatly speeding up SVM prediction.
引用
收藏
页码:281 / 288
页数:8
相关论文
共 31 条
[1]   Applying support vector machines to imbalanced datasets [J].
Akbani, R ;
Kwek, S ;
Japkowicz, N .
MACHINE LEARNING: ECML 2004, PROCEEDINGS, 2004, 3201 :39-50
[2]  
[Anonymous], 2004, ACM Sigkdd Explorations Newsletter
[3]  
[Anonymous], 1999, Proceedings of the International Joint Conference on Artificial Intelligence
[4]  
[Anonymous], AIM1602 MIT
[5]  
[Anonymous], 2004, ACM Sigkdd Explorations Newsletter, DOI 10.1145/1007730.1007739
[6]  
Bargiela A., 2002, Granular computing: an introduction
[7]   The use of the area under the roc curve in the evaluation of machine learning algorithms [J].
Bradley, AP .
PATTERN RECOGNITION, 1997, 30 (07) :1145-1159
[8]   A tutorial on Support Vector Machines for pattern recognition [J].
Burges, CJC .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (02) :121-167
[9]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[10]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)