A local search approximation algorithm for k-means clustering

被引:350
作者
Kanungo, T
Mount, DM [1 ]
Netanyahu, NS
Piatko, CD
Silverman, R
Wu, AY
机构
[1] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA
[2] IBM Corp, Almaden Res Ctr, San Jose, CA 95120 USA
[3] Bar Ilan Univ, Dept Math & Comp Sci, IL-52900 Ramat Gan, Israel
[4] Univ Maryland, Ctr Automat Res, College Pk, MD 20742 USA
[5] Johns Hopkins Univ, Appl Phys Lab, Laurel, MD USA
[6] American Univ, Dept Comp Sci, Washington, DC 20016 USA
来源
COMPUTATIONAL GEOMETRY-THEORY AND APPLICATIONS | 2004年 / 28卷 / 2-3期
关键词
clustering; k-means; approximation algorithms; local search; computational geometry;
D O I
10.1016/j.comgeo.2004.03.003
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
In k-means clustering we are given a set of n data points in d-dimensional space R-d and an integer k, and the problem is to determine a set of k points in R-d, called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomial-time algorithms are known for this problem. Although asymptotically efficient approximation algorithms exist, these algorithms are not practical due to the very high constant factors involved. There are many heuristics that are used in practice, but we know of no bounds on their performance. We consider the question of whether there exists a simple and practical approximation algorithm for k-means clustering. We present a local improvement heuristic based on swapping centers in and out. We prove that this yields a (9 + epsilon)-approximation algorithm. We present an example showing that any approach based on performing a fixed number of swaps achieves an approximation factor of at least (9 - epsilon) in all sufficiently high dimensions. Thus, our approximation factor is almost tight for algorithms based on performing a fixed number of swaps. To establish the practical value of the heuristic, we present an empirical study that shows that, when combined with Lloyd's algorithm, this heuristic performs quite well in practice. (C) 2004 Elsevier B.V. All rights reserved.
引用
收藏
页码:89 / 112
页数:24
相关论文
共 42 条
[1]  
[Anonymous], 1999, 40 ANN S FDN COMP SC
[2]  
[Anonymous], 1998, EX APPR ALG CLUST SO
[3]  
[Anonymous], ELECT ENG COMPUTER S
[4]  
ARARA S, 1998, P 13 ANN ACM S THEOR, P106
[5]   Accounting for boundary effects in nearest-neighbor searching [J].
Arya, S ;
Mount, DM ;
Narayan, O .
DISCRETE & COMPUTATIONAL GEOMETRY, 1996, 16 (02) :155-176
[6]  
Arya V., 2001, 33 ACM S THEORY COMP, P21
[7]  
BALL GH, 1964, INT C MICR CIRC THEO
[8]   Clustering using simulated annealing with probabilistic redistribution [J].
Bandyopadhyay, S ;
Maulik, U ;
Pakhira, MK .
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2001, 15 (02) :269-285
[9]  
Bayardo R.J., 1999, P 5 ACM SIGKDD INT C
[10]   GEOMETRIC CLUSTERINGS [J].
CAPOYLEAS, V ;
ROTE, G ;
WOEGINGER, G .
JOURNAL OF ALGORITHMS, 1991, 12 (02) :341-356