An Efficient Split-Merge Re-Start for the K-Means Algorithm

被引:21
作者
Capo, Marco [1 ]
Perez, Aritz [1 ]
Antonio, Jose A. [1 ,2 ]
机构
[1] Basque Ctr Appl Math, Bilbao 48009, Spain
[2] Univ Basque Country UPV EHU, Dept Comp Sci & Artificial Intelligence, Intelligent Syst Grp, San Sebastian 20018, Spain
关键词
K-means; K-means plus; Hartigan K-means; clustering; MEANS CLUSTERING-ALGORITHM; QUANTIZATION; INITIALIZATION;
D O I
10.1109/TKDE.2020.3002926
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The K-means algorithm is one of the most popular clustering methods. However, it is a well-known fact that its performance, in terms of quality of the obtained solution and computational load, highly depends upon its initialization phase. For this reason, different initialization techniques have been developed throughout the years to enable its fast convergence to competitive solutions. In this sense, it is common practice to re-start the K-means algorithm several times via one of these techniques and keep the solution with the lowest error. Unfortunately, such a choice is still likely to be a poor approximation of the optimal set of centroids. In this article, we introduce a cheap Split-Merge step that can be used to re-start the K-means algorithm after reaching a fixed point. Under some settings, one can show that this approach reduces the error of the given fixed point without requiring any further iteration of the K-means algorithm. Moreover, experimental results show that this strategy is able to generate approximations with an associated error that is hard to reach for different multi-start methods, such as multi-start Forgy K-means, K-means++ and Hartigan K-means, while also computing a lower amount of distances than the previous algorithms.
引用
收藏
页码:1618 / 1627
页数:10
相关论文
共 44 条
[1]   NP-hardness of Euclidean sum-of-squares clustering [J].
Aloise, Daniel ;
Deshpande, Amit ;
Hansen, Pierre ;
Popat, Preyas .
MACHINE LEARNING, 2009, 75 (02) :245-248
[2]  
[Anonymous], 2013, Advances in Neural Information Processing Systems, DOI DOI 10.5555/2999792.2999835
[3]  
[Anonymous], 2016, Advances in Neural Information Processing Systems
[4]  
Arthur D, 2007, PROCEEDINGS OF THE EIGHTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P1027
[5]   Scalable k-Means Clustering via Lightweight Coresets [J].
Bachem, Olivier ;
Lucic, Mario ;
Krause, Andreas .
KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2018, :1119-1127
[6]   Scalable K-Means++ [J].
Bahmani, Bahman ;
Moseley, Benjamin ;
Vattani, Andrea ;
Kumar, Ravi ;
Vassilvitskii, Sergei .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (07) :622-633
[7]  
Ball G. H., 1965, NOVEL METHOD DATA AN
[8]  
Berkhin P, 2006, GROUPING MULTIDIMENSIONAL DATA: RECENT ADVANCES IN CLUSTERING, P25
[9]  
Bottou L., 1995, Advances in Neural Information Processing Systems 7, P585
[10]   An efficient K-means clustering algorithm for tall data [J].
Capo, Marco ;
Perez, Aritz ;
Lozano, Jose A. .
DATA MINING AND KNOWLEDGE DISCOVERY, 2020, 34 (03) :776-811