A comparative study of the K-means algorithm and the normal mixture model for clustering:: Univariate case

被引:22
作者
Qiu, Dingxi [1 ]
Tamhane, Ajit C. [1 ]
机构
[1] Northwestern Univ, Dept Ind Engn & Management Sci, Evanston, IL 60208 USA
关键词
Bayes rule; clustering; data mining; EM algorithm; K-means algorithm; misclassification rate; mixture model; prior and posterior probabilities;
D O I
10.1016/j.jspi.2007.03.045
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
This paper gives a comparative study of the K-means algorithm and the mixture model (MM) method for clustering normal data. The EM algorithm is used to compute the maximum likelihood estimators (MLEs) of the parameters of the MM model. These parameters include mixing proportions, which may be thought of as the prior probabilities of different clusters: the maximum posterior (Bayes) rule is used for clustering. Hence, asymptotically the MM method approaches the Bayes rule for known parameters, which is optimal in terms of minimizing the expected misclassification rate (EMCR). The paper gives a thorough analytic comparison of the two methods for the univariate case Under both homoscedasticity and heteroscedasticity. Simulation results are given to compare the two methods for a range of sample sizes. The comparison, which is limited to two clusters, shows that the MM method has substantially lower EMCR particularly when the mixing proportions are unbalanced. The two methods have asymptotically the same EMCR under homoscedasticity (resp., heteroscedasticity) when the mixing proportions of the two clusters are equal (resp., unequal), but for small samples the MM method sometimes performs slightly worse because of the errors in estimating unknown parameters. (C) 2007 Elsevier B.V. All rights reserved.
引用
收藏
页码:3722 / 3740
页数:19
相关论文
共 7 条
  • [1] Anderson TW., 1958, An introduction to multivariate statistical analysis
  • [2] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM
    DEMPSTER, AP
    LAIRD, NM
    RUBIN, DB
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01): : 1 - 38
  • [3] Everitt BS, 1993, CLUSTER ANAL
  • [4] Hastie T., 2002, ELEMENTS STAT LEARNI
  • [5] Johnson NL., 1970, Continuous univariate distributions-1: distributions in statistics
  • [6] MacQueen J., 1967, Proc fifth Berkeley Symp Math Stat Probab, V1, P281
  • [7] McLachlan G. J., 1997, EM ALGORITHM EXTENSI