A New Sampling Approach for Classification of Imbalanced Data sets with High Density

被引:0
作者
Jia Pengfei [1 ]
Zhang Chunkai [1 ]
He Zhenyu [1 ]
机构
[1] Harbin Inst Technol, Shenzhen Grad Sch, Shenzhen, Peoples R China
来源
2014 INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP) | 2014年
关键词
imbalanced data; classification; high density; big data; sampling method; SMOTE;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Class imbalance of datasets is a common problem in the field of machine learning. In recent years, because the traditional classifier algorithms are designed only for balanced cases, these classifiers always achieved poor performance in imbalanced data classification issues, especially for the imbalanced data with a really high density. This paper introduces the importance of imbalanced data classification in various fields first; then, contends existing methods of solving the imbalanced data classification problem; finally, proposes two new sampling methods, which are based on borderline-SMOTE, for the imbalanced data with high density, especially for big data with this kind of distribution feature. These two new algorithms are not only over-sampling the minority samples near the borderline, but also creating appropriate synthetic samples in the majority class samples side and under-sampling some particular majority class samples. Experiments show that these two algorithms could achieve a better performance than random over sampling, SMOTE (Synthetic minority over-sampling technique) and Borderline-SMOTE in AUC (Area under Receiver Operating Characteristics Curve) metric evaluate method, when the sampling rate makes the majority class and minority class samples approximate equilibrium.
引用
收藏
页码:217 / 222
页数:6
相关论文
共 22 条
[1]  
[Anonymous], INT JOINT C ART INT
[2]  
[Anonymous], 1997, P 14 INT C ONMACHINE
[3]  
Bache K., 2013, UCI Machine Learning Repository
[4]  
Chawla NV, 2005, DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK, P853, DOI 10.1007/0-387-25465-X_40
[5]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[6]   SMOTEBoost: Improving prediction of the minority class in boosting [J].
Chawla, NV ;
Lazarevic, A ;
Hall, LO ;
Bowyer, KW .
KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2003, PROCEEDINGS, 2003, 2838 :107-119
[7]   Start Globally, Optimize Locally, Predict Globally: Improving Performance on Imbalanced Data [J].
Cieslak, David A. ;
Chawla, Nitesh V. .
ICDM 2008: EIGHTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2008, :143-152
[8]   A multiple resampling method for learning from imbalanced data sets [J].
Estabrooks, A ;
Jo, TH ;
Japkowicz, N .
COMPUTATIONAL INTELLIGENCE, 2004, 20 (01) :18-36
[9]   An introduction to ROC analysis [J].
Fawcett, Tom .
PATTERN RECOGNITION LETTERS, 2006, 27 (08) :861-874
[10]   A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches [J].
Galar, Mikel ;
Fernandez, Alberto ;
Barrenechea, Edurne ;
Bustince, Humberto ;
Herrera, Francisco .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2012, 42 (04) :463-484