Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads

被引:0
作者
Mark Ming-Tso Chiang
Boris Mirkin
机构
[1] Birkbeck University of London,Department of Computer Science & Information Systems
[2] State University - Higher School of Economics,undefined
来源
Journal of Classification | 2010年 / 27卷
关键词
-Means clustering; Number of clusters; Anomalous pattern; Hartigan’s rule; Gap statistic;
D O I
暂无
中图分类号
学科分类号
摘要
The issue of determining “the right number of clusters” in K-Means has attracted considerable interest, especially in the recent years. Cluster intermix appears to be a factor most affecting the clustering results. This paper proposes an experimental setting for comparison of different approaches at data generated from Gaussian clusters with the controlled parameters of between- and within-cluster spread to model cluster intermix. The setting allows for evaluating the centroid recovery on par with conventional evaluation of the cluster recovery. The subjects of our interest are two versions of the “intelligent” K-Means method, ik-Means, that find the “right” number of clusters by extracting “anomalous patterns” from the data one-by-one. We compare them with seven other methods, including Hartigan’s rule, averaged Silhouette width and Gap statistic, under different between- and within-cluster spread-shape conditions. There are several consistent patterns in the results of our experiments, such as that the right K is reproduced best by Hartigan’s rule – but not clusters or their centroids. This leads us to propose an adjusted version of iK-Means, which performs well in the current experiment setting.
引用
收藏
页码:3 / 40
页数:37
相关论文
共 68 条
[1]  
Banfield JD(1993)Model-based Gaussian and Non-Gaussian Clustering Biometrics 49 803-821
[2]  
Raftery AE(1989)Replicating Cluster Analysis: Method, Consistency and Validity Multivariate Behavioral Research 24 147-61
[3]  
Breckenridge J(1974)A Dendrite Method for Cluster Analysis Communications in Statistics 3 1-27
[4]  
Calinski T(2006)A Method of Predicting the Number of Clusters Using Rand’s Statistic Computational Statistics and Data Analysis 50 3531-3546
[5]  
Harabasz J(2002)An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets Psychometrika 67 137-160
[6]  
Chae SS(2002)Model-based Clustering, Discriminant Analysis, and Density Estimation Journal of the American Statistical Association 97 611-631
[7]  
Dubien JL(2005)Optimising Computational Statistics and Data Analysis 49 969-973
[8]  
Warde WD(2001)-means Clustering Results with Standard Software Packages Pattern Recognition 34 405-413
[9]  
Dimitriadou E(1996)-MEANS: A New Local Search Heuristic for Minimum Sum of Squares Clustering Computational Statistics & Data Analysis 23 83-96
[10]  
Dolnicar S(1985)On the Number of Clusters Journal of Classification 2 193-218