Clustering for Metric and Nonmetric Distance Measures

被引:57
作者
Ackermann, Marcel R. [1 ]
Bloemer, Johannes [1 ]
Sohler, Christian [2 ]
机构
[1] Univ Paderborn, Dept Comp Sci, D-33095 Paderborn, Germany
[2] Tech Univ, Dept Comp Sci, D-44221 Dortmund, Germany
关键词
Approximation algorithm; Bregman divergences; Itakura-Saito divergence; k-means clustering; k-median clustering; Kullback-Leibler divergence; Mahalanobis distance; random sampling;
D O I
10.1145/1824777.1824779
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We study a generalization of the k-median problem with respect to an arbitrary dissimilarity measure D. Given a finite set P of size n, our goal is to find a set C of size k such that the sum of errors D(P, C) = Sigma(p is an element of P) min(c is an element of C) {D(p, c)} is minimized. The main result in this article can be stated as follows: There exists a (1 + epsilon)-approximation algorithm for the k-median problem with respect to D, if the I-median problem can be approximated within a factor of (1 + epsilon) by taking a random sample of constant size and solving the I-median problem on the sample exactly. This algorithm requires time n2(O(mklog(mk/epsilon))), where m is a constant that depends only on epsilon and D. Using this characterization, we obtain the first linear time (1 + epsilon)-approximation algorithms for the k-median problem in an arbitrary metric space with bounded doubling dimension, for the Kullback-Leibler divergence (relative entropy), for the Itakura-Saito divergence, for Mahalanobis distances, and for some special cases of Bregman divergences. Moreover, we obtain previously known results for the Euclidean k-median problem and the Euclidean k-means problem in a simplified manner. Our results are based on a new analysis of an algorithm of Kumar et al. [2004].
引用
收藏
页数:26
相关论文
共 39 条
[11]  
Banerjee A, 2005, J MACH LEARN RES, V6, P1705
[12]  
Bregman L. M., 1967, USSR Comput Math Math Phys, V7, P200, DOI [10.1016/0041-5553(67)90040-7, DOI 10.1016/0041-5553(67)90040-7]
[13]   SPEECH CODING BASED UPON VECTOR QUANTIZATION [J].
BUZO, A ;
GRAY, AH ;
GRAY, RM ;
MARKEL, JD .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1980, 28 (05) :562-574
[14]  
Censor Y, 1997, NUMERICAL MATH SCI C
[15]  
Chaudhuri K., 2008, P COLT 2008, P391
[16]   On k-Median Clustering in High Dimensions [J].
Chen, Ke .
PROCEEDINGS OF THE SEVENTHEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2006, :1177-1185
[17]   ON CORESETS FOR k-MEDIAN AND k-MEANS CLUSTERING IN METRIC AND EUCLIDEAN SPACES AND THEIR APPLICATIONS [J].
Chen, Ke .
SIAM JOURNAL ON COMPUTING, 2009, 39 (03) :923-947
[18]  
Dhillon I. S., 2003, Journal of Machine Learning Research, V3, P1265, DOI 10.1162/153244303322753661
[19]  
Feldman D., 2007, P 23 ANN S COMP GEOM, P11, DOI DOI 10.1145/1247069.1247072
[20]   Bounded geometries, fractals, and low-distortion embeddings [J].
Gupta, A ;
Krauthgamer, R ;
Lee, JR .
44TH ANNUAL IEEE SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, PROCEEDINGS, 2003, :534-543