In search of deterministic methods for initializing K-means and Gaussian mixture clustering

被引:84
作者
Su, Ting [1 ]
Dy, Jennifer G. [1 ]
机构
[1] Northeastern Univ, Dept Elect & Comp Engn, Boston, MA 02115 USA
关键词
K-means; Gaussian mixture; initialization; PCA; clustering;
D O I
10.3233/IDA-2007-11402
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The performance of K-means and Gaussian mixture model (GMM) clustering depends on the initial guess of partitions. Typically, clustering algorithms are initialized by random starts. In our search for a deterministic method, we found two promising approaches: principal component analysis (PCA) partitioning and Var-Part (Variance Partitioning). K-means clustering tries to minimize the sum-squared-error criterion. The largest eigenvector with the largest eigenvalue is the component which contributes to the largest sum-squared-error. Hence, a good candidate direction to project a cluster for splitting is the direction of the cluster's largest eigenvector, the basis for PCA partitioning. Similarly, GMM clustering maximizes the likelihood; minimizing the determinant of the covariance matrices of each cluster helps to increase the likelihood. The largest eigenvector contributes to the largest determinant and is thus a good candidate direction for splitting. However, PCA is computationally expensive. We, thus, introduce Var-Part, which is computationally less complex (with complexity equal to one K-means iteration) and approximates PCA partitioning assuming diagonal covariance matrix. Experiments reveal that Var-Part has similar performance with PCA partitioning, sometimes better, and leads K-means (and GMM) to yield sum-squared-error (and maximum-likelihood) values close to the optimum values obtained by several random-start runs and often at faster convergence rates.
引用
收藏
页码:319 / 338
页数:20
相关论文
共 50 条
  • [31] K-Means Divide and Conquer Clustering
    Khalilian, Madjid
    Boroujeni, Farsad Zamani
    Mustapha, Norwati
    Sulaiman, Md. Nasir
    2009 INTERNATIONAL CONFERENCE ON COMPUTER AND AUTOMATION ENGINEERING, PROCEEDINGS, 2009, : 306 - 309
  • [32] The validity of pyramid K-means clustering
    Tamir, Dan E.
    Park, Chi-Yeon
    Yoo, Wook-Sung
    MATHEMATICS OF DATA/IMAGE PATTERN RECOGNITION, COMPRESSION, CODING, AND ENCRYPTION X, WITH APPLICATIONS, 2007, 6700
  • [33] Adaptive K-Means clustering algorithm
    Chen, Hailin
    Wu, Xiuqing
    Hu, Junhua
    MIPPR 2007: PATTERN RECOGNITION AND COMPUTER VISION, 2007, 6788
  • [34] Improved Algorithm for the k-means Clustering
    Zhang, Sheng
    Wang, Shouqiang
    PROCEEDINGS OF THE 10TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION (WCICA 2012), 2012, : 4717 - 4720
  • [35] An improved K-means clustering algorithm
    Huang, Xiuchang
    Su, Wei
    Journal of Networks, 2014, 9 (01) : 161 - 167
  • [36] Locally Private k-Means Clustering
    Stemmer, Uri
    JOURNAL OF MACHINE LEARNING RESEARCH, 2021, 22
  • [37] An Enhancement of K-means Clustering Algorithm
    Gu, Jirong
    Zhou, Jieming
    Chen, Xianwei
    2009 INTERNATIONAL CONFERENCE ON BUSINESS INTELLIGENCE AND FINANCIAL ENGINEERING, PROCEEDINGS, 2009, : 237 - 240
  • [38] Optimization of K-means clustering method using hybrid capuchin search algorithm
    Amjad Qtaish
    Malik Braik
    Dheeb Albashish
    Mohammad T. Alshammari
    Abdulrahman Alreshidi
    Eissa Jaber Alreshidi
    The Journal of Supercomputing, 2024, 80 : 1728 - 1787
  • [39] Optimization of Clustering process for WSN with Hybrid Harmony search and K-means algorithm
    Raval, Dharmanshu
    Raval, Gaurang
    Valiveti, Sharada
    2016 5TH INTERNATIONAL CONFERENCE ON RECENT TRENDS IN INFORMATION TECHNOLOGY (ICRTIT), 2016,
  • [40] Optimization of K-means clustering method using hybrid capuchin search algorithm
    Qtaish, Amjad
    Braik, Malik
    Albashish, Dheeb
    Alshammari, Mohammad T. T.
    Alreshidi, Abdulrahman
    Alreshidi, Eissa Jaber
    JOURNAL OF SUPERCOMPUTING, 2024, 80 (02) : 1728 - 1787