An Efficient Split-Merge Re-Start for the K-Means Algorithm

被引:26
作者
Capo, Marco [1 ]
Perez, Aritz [1 ]
Antonio, Jose A. [1 ,2 ]
机构
[1] Basque Ctr Appl Math, Bilbao 48009, Spain
[2] Univ Basque Country UPV EHU, Dept Comp Sci & Artificial Intelligence, Intelligent Syst Grp, San Sebastian 20018, Spain
关键词
K-means; K-means plus; Hartigan K-means; clustering; MEANS CLUSTERING-ALGORITHM; QUANTIZATION; INITIALIZATION;
D O I
10.1109/TKDE.2020.3002926
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The K-means algorithm is one of the most popular clustering methods. However, it is a well-known fact that its performance, in terms of quality of the obtained solution and computational load, highly depends upon its initialization phase. For this reason, different initialization techniques have been developed throughout the years to enable its fast convergence to competitive solutions. In this sense, it is common practice to re-start the K-means algorithm several times via one of these techniques and keep the solution with the lowest error. Unfortunately, such a choice is still likely to be a poor approximation of the optimal set of centroids. In this article, we introduce a cheap Split-Merge step that can be used to re-start the K-means algorithm after reaching a fixed point. Under some settings, one can show that this approach reduces the error of the given fixed point without requiring any further iteration of the K-means algorithm. Moreover, experimental results show that this strategy is able to generate approximations with an associated error that is hard to reach for different multi-start methods, such as multi-start Forgy K-means, K-means++ and Hartigan K-means, while also computing a lower amount of distances than the previous algorithms.
引用
收藏
页码:1618 / 1627
页数:10
相关论文
共 44 条
[11]   An efficient K-means clustering algorithm for tall data [J].
Capo, Marco ;
Perez, Aritz ;
Lozano, Jose A. .
DATA MINING AND KNOWLEDGE DISCOVERY, 2020, 34 (03) :776-811
[12]   An efficient approximation to the K-means clustering for massive data [J].
Capo, Marco ;
Perez, Aritz ;
Lozano, Jose A. .
KNOWLEDGE-BASED SYSTEMS, 2017, 117 :56-69
[13]   A comparative study of efficient initialization methods for the k-means clustering algorithm [J].
Celebi, M. Emre ;
Kingravi, Hassan A. ;
Vela, Patricio A. .
EXPERT SYSTEMS WITH APPLICATIONS, 2013, 40 (01) :200-210
[14]  
Dhillon I. S., 2002, Proceedings 2002 IEEE International Conference on Data Mining. ICDM 2002, P131, DOI 10.1109/ICDM.2002.1183895
[15]  
Ding YF, 2015, PR MACH LEARN RES, V37, P579
[16]  
ELKAN C., 2003, P 20 INT C MACH LEAR, P147
[17]   A NEW VECTOR QUANTIZATION CLUSTERING-ALGORITHM [J].
EQUITZ, WH .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1989, 37 (10) :1568-1575
[18]  
Feldman D., 2007, Proceedings of the twenty-third annual symposium on computational geometry, P11
[19]  
FORGY EW, 1965, BIOMETRICS, V21, P768
[20]   How much can k-means be improved by using better initialization and repeats? [J].
Franti, Pasi ;
Sieranoja, Sami .
PATTERN RECOGNITION, 2019, 93 :95-112