Preprocessing unbalanced data using support vector machine

被引:139
作者
Farquad, M. A. H. [2 ]
Bose, Indranil [1 ]
机构
[1] Indian Inst Management Calcutta, Kolkata 700104, India
[2] Univ Hong Kong, Sch Business, Hong Kong, Hong Kong, Peoples R China
关键词
Hybrid method; Preprocessor; SVM; Unbalanced data; COIL data; IMBALANCED DATA; CLASSIFICATION; FRAUD; PREDICTION; SELECTION;
D O I
10.1016/j.dss.2012.01.016
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper deals with the application of support vector machine (SVM) to deal with the class imbalance problem. The objective of this paper is to examine the feasibility and efficiency of SVM as a preprocessor. Our study analyzes different classification algorithms that are employed to predict the customers with caravan car policy based on his/her sociodemographic data and history of product ownership. A series of experiments was conducted to test various computational intelligence techniques viz., Multilayer Perceptron (MLP), Logistic Regression (LR), and Random Forest (RF). Various standard balancing techniques such as under-sampling, over-sampling and Synthetic Minority Over-sampling TEchnique (SMOTE) are also employed. Subsequently, a strategy of data balancing for handling imbalanced distribution in data is proposed. The proposed approach first employs SVM as a preprocessor and the actual target values of training data are then replaced by the predictions of trained SVM. Later, this modified training data is used to train techniques such as MLP, LR, and RF. Based on the measure of sensitivity, it is observed that the proposed approach not only balances the data effectively but also provides more number of instances for minority class, which in turn enhances the performance of the intelligence techniques. (C) 2012 Elsevier B.V. All rights reserved.
引用
收藏
页码:226 / 233
页数:8
相关论文
共 55 条
[1]   Colon cancer prediction with genetic profiles using intelligent techniques [J].
Alladi, Subha Mahadevi ;
Santosh, Shinde P. ;
Ravi, Vadlamani ;
Murthy, Upadhyayula Suryanarayana .
BIOINFORMATION, 2008, 3 (03) :130-133
[2]  
[Anonymous], LOGISTIC REGRESSION
[3]  
[Anonymous], 2000, 200009 LEID I ADV CO
[4]  
[Anonymous], 2004, ACM SIGKDD EXPLORATI, DOI DOI 10.1145/1007730.1007737
[5]  
[Anonymous], NATURE STATISTI810
[6]  
[Anonymous], PATTERN RECOGN LETT
[7]  
[Anonymous], P ICML 2003 WORKSH L
[8]  
[Anonymous], 2003, ICML 2003 WORKSH LEA
[9]  
[Anonymous], 1997, P 14 INT C ONMACHINE
[10]  
[Anonymous], MACHINE LEARNING