The next-generation K-means algorithm

被引:19
作者
Demidenko, Eugene [1 ,2 ]
机构
[1] Dartmouth Coll, Dept Biomed Data Sci, Hanover, NH 03755 USA
[2] Dartmouth Coll, Dept Math, Hanover, NH 03755 USA
关键词
clusterwise regression; hard classification; K-medians; maximum likelihood; multilevel data; robust clustering; SigClust; STATISTICAL SIGNIFICANCE; LINEAR-REGRESSION; CLUSTER-ANALYSIS; DATA SET; MIXTURE; NUMBER;
D O I
10.1002/sam.11379
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Typically, when referring to a model-based classification, the mixture distribution approach is understood. In contrast, we revive the hard-classification model-based approach developed by Banfield and Raftery (1993) for which K-means is equivalent to the maximum likelihood (ML) estimation. The next-generation K-means algorithm does not end after the classification is achieved, but moves forward to answer the following fundamental questions: Are there clusters, how many clusters are there, what are the statistical properties of the estimated means and index sets, what is the distribution of the coefficients in the clusterwise regression, and how to classify multilevel data? The statistical model-based approach for the K-means algorithm is the key, because it allows statistical simulations and studying the properties of classification following the track of the classical statistics. This paper illustrates the application of the ML classification to testing the no-clusters hypothesis, to studying various methods for selection of the number of clusters using simulations, robust clustering using Laplace distribution, studying properties of the coefficients in clusterwise regression, and finally to multilevel data by marrying the variance components model with K-means.
引用
收藏
页码:153 / 166
页数:14
相关论文
共 42 条
[1]  
[Anonymous], 2001, P 18 INT C MACH LEAR
[2]  
[Anonymous], LECT NOTES COMPUTER
[3]  
[Anonymous], 1988, Algorithms for Clustering Data
[4]  
[Anonymous], 2003, Introduction to Nessus
[5]   Robust clustering [J].
Banerjee, Amit ;
Dave, Rajesh N. .
WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2012, 2 (01) :29-59
[6]   MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].
BANFIELD, JD ;
RAFTERY, AE .
BIOMETRICS, 1993, 49 (03) :803-821
[7]  
Basu S., 2002, P FASEB SUMM RES M A, P9
[8]   ON SOME SIGNIFICANCE TESTS IN CLUSTER-ANALYSIS [J].
BOCK, HH .
JOURNAL OF CLASSIFICATION, 1985, 2 (01) :77-108
[9]  
Bradley PS, 1997, ADV NEUR IN, V9, P368
[10]   ASYMPTOTIC-BEHAVIOR OF CLASSIFICATION MAXIMUM LIKELIHOOD ESTIMATES [J].
BRYANT, P ;
WILLIAMSON, JA .
BIOMETRIKA, 1978, 65 (02) :273-281