A Distance-Based Weighted Undersampling Scheme for Support Vector Machines and its Application to Imbalanced Classification

被引:120
作者
Kang, Qi [1 ]
Shi, Lei [1 ]
Zhou, MengChu [2 ,3 ]
Wang, XueSong [1 ]
Wu, Qidi [1 ]
Wei, Zhi [4 ]
机构
[1] Tongji Univ, Sch Elect & Informat Engn, Dept Control Sci & Engn, Shanghai 201804, Peoples R China
[2] Macau Univ Sci & Technol, Inst Syst Engn, Macau 999078, Peoples R China
[3] New Jersey Inst Technol, Dept Elect & Comp Engn, Newark, NJ 07102 USA
[4] New Jersey Inst Technol, Dept Comp Sci, Newark, NJ 07102 USA
关键词
Class imbalance; data distribution; Euclidean distance; support vector machine (SVM); undersampling; ENSEMBLE;
D O I
10.1109/TNNLS.2017.2755595
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A support vector machine (SVM) plays a prominent role in classic machine learning, especially classification and regression. Through its structural risk minimization, it has enjoyed a good reputation in effectively reducing overfitting, avoiding dimensional disaster, and not falling into local minima. Nevertheless, existing SVMs do not perform well when facing class imbalance and large-scale samples. Undersampling is a plausible alternative to solve imbalanced problems in some way, but suffers from soaring computational complexity and reduced accuracy because of its enormous iterations and random sampling process. To improve their classification performance in dealing with data imbalance problems, this work proposes a weighted undersampling (WU) scheme for SVM based on space geometry distance, and thus produces an improved algorithm named WU-SVM. In WU-SVM, majority samples are grouped into some subregions (SRs) and assigned different weights according to their Euclidean distance to the hyper plane. The samples in an SR with higher weight have more chance to be sampled and put to use in each learning iteration, so as to retain the data distribution information of original data sets as much as possible. Comprehensive experiments are performed to test WU-SVM via 21 binary-class and six multiclass publically available data sets. The results show that it well outperforms the state-of-the-art methods in terms of three popular metrics for imbalanced classification, i.e., area under the curve, F-Measure, and G-Mean.
引用
收藏
页码:4152 / 4165
页数:14
相关论文
共 46 条
  • [11] Studying the Effect of Class Imbalance in Ocean Turbine Fault Data on Reliable State Detection
    Duhaney, Janell
    Khoshgoftaar, Taghi M.
    Napolitano, Amri
    [J]. 2012 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2012), VOL 1, 2012, : 268 - 275
  • [12] A multiple resampling method for learning from imbalanced data sets
    Estabrooks, A
    Jo, TH
    Japkowicz, N
    [J]. COMPUTATIONAL INTELLIGENCE, 2004, 20 (01) : 18 - 36
  • [13] Feng XW, 2016, IEEE-CAA J AUTOMATIC, V3, P149, DOI 10.1109/JAS.2016.7451102
  • [14] Microwave Characterization Using Least-Square Support Vector Machines
    Hacib, Tarik
    Le Bihan, Yann
    Mekideche, Mohamed Rachid
    Acikgoz, Hulusi
    Meyer, Olivier
    Pichon, Lionel
    [J]. IEEE TRANSACTIONS ON MAGNETICS, 2010, 46 (08) : 2811 - 2814
  • [15] Learning from Imbalanced Data
    He, Haibo
    Garcia, Edwardo A.
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2009, 21 (09) : 1263 - 1284
  • [16] Common Bayesian Network for Classification of EEG-Based Multiclass Motor Imagery BCI
    He, Lianghua
    Hu, Die
    Wan, Meng
    Wen, Ying
    von Deneen, Karen M.
    Zhou, MengChu
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2016, 46 (06): : 843 - 854
  • [17] A distributed PSO-SVM hybrid system with feature selection and parameter optimization
    Huang, Cheng-Lung
    Dun, Jian-Fan
    [J]. APPLIED SOFT COMPUTING, 2008, 8 (04) : 1381 - 1391
  • [18] Solution Path for Pin-SVM Classifiers With Positive and Negative τ Values
    Huang, Xiaolin
    Shi, Lei
    Suykens, Johan A. K.
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2017, 28 (07) : 1584 - 1593
  • [19] Kang PS, 2006, LECT NOTES COMPUT SC, V4232, P837
  • [20] A Noise-Filtered Under-Sampling Scheme for Imbalanced Classification
    Kang, Qi
    Chen, XiaoShuang
    Li, Sisi
    Zhou, MengChu
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2017, 47 (12) : 4263 - 4274