What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm

被引:128
作者
Raykov, Yordan P. [1 ]
Boukouvalas, Alexis [2 ]
Baig, Fand [3 ]
Little, Max A. [1 ,4 ]
机构
[1] Aston Univ, Sch Math, Birmingham, W Midlands, England
[2] Univ Manchester, Mol Sci, Manchester, Lancs, England
[3] Univ Oxford, Nuffield Dept Clin Neurosci, Oxford, England
[4] MIT, Media Lab, Cambridge, MA 02139 USA
基金
美国国家卫生研究院;
关键词
PARKINSONS-DISEASE; HETEROGENEITY; INFERENCE; SUBTYPES; MIXTURE;
D O I
10.1371/journal.pone.0162259
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.
引用
收藏
页数:28
相关论文
共 50 条
[1]  
[Anonymous], 2010, Bayesian Nonparametrics
[2]  
[Anonymous], IEEE COMP SOC C COMP
[3]  
[Anonymous], ADV NEURAL INFORM PR
[4]  
Arthur D, 2007, PROCEEDINGS OF THE EIGHTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P1027
[5]  
Beal MJ, 2002, ADV NEUR IN, V14, P577
[6]  
Berkhin P, 2006, GROUPING MULTIDIMENSIONAL DATA: RECENT ADVANCES IN CLUSTERING, P25
[7]   MDL principle for robust vector quantisation [J].
Bischof, H ;
Leonardis, A ;
Selb, A .
PATTERN ANALYSIS AND APPLICATIONS, 1999, 2 (01) :59-72
[8]  
Bishop C., 2006, Pattern recognition and machine learning, P423
[9]   CONDITIONAL EXPECTATION AND UNBIASED SEQUENTIAL ESTIMATION [J].
BLACKWELL, D .
ANNALS OF MATHEMATICAL STATISTICS, 1947, 18 (01) :105-110
[10]  
Blei D. M., 2004, International Conference on Machine Learning, P12, DOI DOI 10.1145/1015330.1015439