An Efficient Split-Merge Re-Start for the K-Means Algorithm

被引：26

作者：

Capo, Marco ^{[1
]}

Perez, Aritz ^{[1
]}

Antonio, Jose A. ^{[1
,2
]}

机构：

[1] Basque Ctr Appl Math, Bilbao 48009, Spain

[2] Univ Basque Country UPV EHU, Dept Comp Sci & Artificial Intelligence, Intelligent Syst Grp, San Sebastian 20018, Spain

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2022年 / 34卷 / 04期

关键词：

K-means; K-means plus; Hartigan K-means; clustering; MEANS CLUSTERING-ALGORITHM; QUANTIZATION; INITIALIZATION;

D O I：

10.1109/TKDE.2020.3002926

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The K-means algorithm is one of the most popular clustering methods. However, it is a well-known fact that its performance, in terms of quality of the obtained solution and computational load, highly depends upon its initialization phase. For this reason, different initialization techniques have been developed throughout the years to enable its fast convergence to competitive solutions. In this sense, it is common practice to re-start the K-means algorithm several times via one of these techniques and keep the solution with the lowest error. Unfortunately, such a choice is still likely to be a poor approximation of the optimal set of centroids. In this article, we introduce a cheap Split-Merge step that can be used to re-start the K-means algorithm after reaching a fixed point. Under some settings, one can show that this approach reduces the error of the given fixed point without requiring any further iteration of the K-means algorithm. Moreover, experimental results show that this strategy is able to generate approximations with an associated error that is hard to reach for different multi-start methods, such as multi-start Forgy K-means, K-means++ and Hartigan K-means, while also computing a lower amount of distances than the previous algorithms.

引用

页码：1618 / 1627

页数：10

共 44 条

[1] NP-hardness of Euclidean sum-of-squares clustering [J].

Aloise, Daniel ;

Deshpande, Amit ;

Hansen, Pierre ;

Popat, Preyas .

MACHINE LEARNING, 2009, 75 (02) :245-248

[2]

[Anonymous], 2010, P 19 INT C WORLD WID, DOI [10.1145/1772690.1772862, DOI 10.1145/1772690.1772862]

[3]

[Anonymous], 2013, Proc. Adv. Neural Inf. Process. Syst.

[4]

Arthur D, 2007, PROCEEDINGS OF THE EIGHTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P1027

[5]

Bachem O., 2016, Adv Neural Inf Process Syst, P55

[6] Scalable k-Means Clustering via Lightweight Coresets [J].

Bachem, Olivier ;

Lucic, Mario ;

Krause, Andreas .

KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2018, :1119-1127

[7] Scalable K-Means++ [J].

Bahmani, Bahman ;

Moseley, Benjamin ;

Vattani, Andrea ;

Kumar, Ravi ;

Vassilvitskii, Sergei .

PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (07) :622-633

[8]

Ball, 1965, ISODATA NOVEL METHOD

[9]

Berkhin P, 2006, GROUPING MULTIDIMENSIONAL DATA: RECENT ADVANCES IN CLUSTERING, P25

[10]

Bottou L., 1995, Advances in Neural Information Processing Systems 7, P585

← 1 2 3 4 5 →