Undersampled K-means approach for handling imbalanced distributed data

被引:0
作者
Kumar, N. Santhosh [1 ]
Rao, K. Nageswara [2 ]
Govardhan, A. [3 ,4 ]
Reddy, K. Sudheer [5 ]
Mahmood, Ali Mirza [6 ]
机构
[1] JNTU, Dept CSE, Hyderabad, Andhra Prades, India
[2] PSCMR Coll Engn & Technol, Vijayawada, Andhra Prades, India
[3] CSE, Hyderabad, Andhra Prades, India
[4] JNTU, SIT, Hyderabad, Andhra Prades, India
[5] Infosys, Hyderabad, Andhra Prades, India
[6] DMS SVH Coll Engn, Machilipatam, Andhra Prades, India
关键词
Imbalanced data; K-means clustering algorithms; Undersampling; USKM;
D O I
10.1007/s13748-014-0045-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
K-means is a partitional clustering technique that is well known and widely used for its low computational cost. However, the performance of K-means algorithm tends to be affected by skewed data distributions, i.e., imbalanced data. They often produce clusters of relatively uniform sizes, even if input data have varied cluster size, which is called the "uniform effect". In this paper, we analyze the causes of this effect and illustrate that it probably occurs more in the K-means clustering process. As the minority class decreases in size, the "uniform effect" becomes evident. To prevent the effect of the "uniform effect", we revisit the well-known K-means algorithm and provide a general method to properly cluster imbalance distributed data. The proposed algorithm consists of a novel undersampling technique implemented by intelligently removing noisy and weak instances from majority class. We conduct experiments using twelve UCI datasets from various application domains using five algorithms for comparison on eight evaluation metrics. Experimental results show the effectiveness of the proposed clustering algorithm in clustering balanced and imbalanced data.
引用
收藏
页码:29 / 38
页数:10
相关论文
共 39 条
[1]  
Blake C, 2000, UCI REPOSITORY MACHI
[2]   A clustering technique for news articles using WordNet [J].
Bouras, Christos ;
Tsogkas, Vassilis .
KNOWLEDGE-BASED SYSTEMS, 2012, 36 :115-128
[3]   Towards information-theoretic K-means clustering for image indexing [J].
Cao, Jie ;
Wu, Zhiang ;
Wu, Junjie ;
Liu, Wenjie .
SIGNAL PROCESSING, 2013, 93 (07) :2026-2037
[4]   A methodological approach to the classification of dermoscopy images [J].
Celebi, M. Emre ;
Kingravi, Hassan A. ;
Uddin, Bakhtiyar ;
Lyatornid, Hitoshi ;
Aslandogan, Y. Alp ;
Stoecker, William V. ;
Moss, Randy H. .
COMPUTERIZED MEDICAL IMAGING AND GRAPHICS, 2007, 31 (06) :362-373
[5]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[6]   Combating imbalance in network intrusion datasets [J].
Cieslak, David A. ;
Chawla, Nitesh V. ;
Striegel, Aaron .
2006 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING, 2006, :732-+
[7]  
Dasgupta S., 2002, Computational Learning Theory. 15th Annual Conference on Computational Learning Theory, COLT 2002. Proceedings (Lecture Notes in Artificial Intelligence Vol.2375), P351
[8]   Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering [J].
de Amorim, Renato Cordeiro ;
Mirkin, Boris .
PATTERN RECOGNITION, 2012, 45 (03) :1061-1075
[9]  
Demsar J, 2006, J MACH LEARN RES, V7, P1
[10]  
Freitas A., LECT NOTES SERIES CO