Parallel selective sampling method for imbalanced and large data classification

被引:41
作者
D'Addabbo, Annarita [1 ]
Maglietta, Rosalia [1 ]
机构
[1] CNR, Inst Intelligent Syst Automat, I-70126 Bari, Italy
关键词
Imbalanced learning; Classification; Support vector machine; Selective sampling methods; SUPPORT VECTOR MACHINES; SVMS;
D O I
10.1016/j.patrec.2015.05.008
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Several applications aim to identify rare events from very large data sets. Classification algorithms may present great limitations on large data sets and show a performance degradation due to class imbalance. Many solutions have been presented in literature to deal with the problem of huge amount of data or imbalancing separately. In this paper we assessed the performances of a novel method, Parallel Selective Sampling (PSS), able to select data from the majority class to reduce imbalance in large data sets. PSS was combined with the Support Vector Machine (SVM) classification. PSS-SVM showed excellent performances on synthetic data sets, much better than SVM. Moreover, we showed that on real data sets PSS-SVM classifiers had performances slightly better than those of SVM and RUSBoost classifiers with reduced processing times. In fact, the proposed strategy was conceived and designed for parallel and distributed computing. In conclusion, PSS-SVM is a valuable alternative to SVM and RUSBoost for the problem of classification by huge and imbalanced data, due to its accurate statistical predictions and low computational complexity. (C) 2015 The Authors. Published by Elsevier B.V.
引用
收藏
页码:61 / 67
页数:7
相关论文
共 36 条
[1]   Applying support vector machines to imbalanced datasets [J].
Akbani, R ;
Kwek, S ;
Japkowicz, N .
MACHINE LEARNING: ECML 2004, PROCEEDINGS, 2004, 3201 :39-50
[2]   Supervised algorithms for particle classification by a transition radiation detector [J].
Ambriola, M ;
Bellotti, R ;
Circella, M ;
Maglietta, R ;
Stramaglia, S .
NUCLEAR INSTRUMENTS & METHODS IN PHYSICS RESEARCH SECTION A-ACCELERATORS SPECTROMETERS DETECTORS AND ASSOCIATED EQUIPMENT, 2003, 510 (03) :362-370
[3]  
Ancona N., 2004, MACH LEARN APPL P, V11, P129
[4]  
Blake C. L., 1998, Uci repository of machine learning databases
[5]   Selection of relevant features and examples in machine learning [J].
Blum, AL ;
Langley, P .
ARTIFICIAL INTELLIGENCE, 1997, 97 (1-2) :245-271
[6]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[7]   Classification of imbalanced remote-sensing data by neural networks [J].
Bruzzone, L ;
Serpico, SB .
PATTERN RECOGNITION LETTERS, 1997, 18 (11-13) :1323-1328
[8]   Support vector machine classification for large data sets via minimum enclosing ball clustering [J].
Cervantes, Jair ;
Li, Xiaoou ;
Yu, Wen ;
Li, Kang .
NEUROCOMPUTING, 2008, 71 (4-6) :611-619
[9]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[10]  
Evgeniou T, 2002, LECT NOTES ARTIF INT, V2308, P346