Boosting support vector machines for imbalanced data sets

被引:0
作者
Benjamin X. Wang
Nathalie Japkowicz
机构
[1] Datalong technology Ltd.,School of Information Technology and Engineering
[2] University of Ottawa,undefined
来源
Knowledge and Information Systems | 2010年 / 25卷
关键词
Imbalanced data sets; Support vector machines; Boosting;
D O I
暂无
中图分类号
学科分类号
摘要
Real world data mining applications must address the issue of learning from imbalanced data sets. The problem occurs when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed vector spaces or lack of information. Common approaches for dealing with the class imbalance problem involve modifying the data distribution or modifying the classifier. In this work, we choose to use a combination of both approaches. We use support vector machines with soft margins as the base classifier to solve the skewed vector spaces problem. We then counter the excessive bias introduced by this approach with a boosting algorithm. We found that this ensemble of SVMs makes an impressive improvement in prediction performance, not only for the majority class, but also for the minority class.
引用
收藏
页码:1 / 20
页数:19
相关论文
共 22 条
[1]  
Amari S(1999)Improving support vector machine classifiers by modifying kernel functions Neural Networks 12 783-789
[2]  
Wu S(1995)Support-vector networks Mach Learn 20 273-297
[3]  
Cortes C(2006)Statistical comparisons of classifiers over multiple data sets J Mach Learn Res 7 1-30
[4]  
Vapnik V(1997)A decision-theoretic generalization of on-line learning and an application to boosting J Comp Syst Sci 55 119-139
[5]  
Demsar J(2004)Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach ACM SIGKDD Explor 6 30-39
[6]  
Freund Y(2002)Instance-based data reduction for improved identification of difficult small classes Intell Data Anal 6 311-322
[7]  
Schapire RE(2002)The class imbalance problem: a systematic study Intell Data Anal 6 429-450
[8]  
Guo H(2006)Quality assessment of individual classifications in machine learning and data mining Knowl Informa Syst 9 3-281
[9]  
Viktor HL(2003)Inference for the generalization error Mach Learn 53 239-665
[10]  
Laurikkala J(2007)Ranking-based evaluation of regression models Knowl Inform Syst 12 3-334