Faster Algorithms for the Constrained k-means Problem

被引:27
作者
Bhattacharya, Anup [1 ]
Jaiswal, Ragesh [1 ]
Kumar, Amit [1 ]
机构
[1] Indian Inst Technol Delhi, Dept Comp Sci & Engn, New Delhi, India
关键词
Constrained k-means clustering; D-2; sampling; CLUSTERING PROBLEMS; PTAS;
D O I
10.1007/s00224-017-9820-7
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The classical center based clustering problems such as k-means/median/center assume that the optimal clusters satisfy the locality property that the points in the same cluster are close to each other. A number of clustering problems arise in machine learning where the optimal clusters do not follow such a locality property. For instance, consider the r -gather clustering problem where there is an additional constraint that each of the clusters should have at least r points or the capacitated clustering problem where there is an upper bound on the cluster sizes. Consider a variant of the k-means problem that may be regarded as a general version of such problems. Here, the optimal clusters O-1, ... , O-k are an arbitrary partition of the dataset and the goal is to output k-centers c(1), ... , c (k) such that the objective function Sigma(k)(i=1) Sigma(x is an element of Oi) parallel to x - c(i)parallel to(2) is minimized. It is not difficult to argue that any algorithm (without knowing the optimal clusters) that outputs a single set of k centers, will not behave well as far as optimizing the above objective function is concerned. However, this does not rule out the existence of algorithms that output a list of such k centers such that at least one of these k centers behaves well. Given an error parameter epsilon > 0, let l denote the size of the smallest list of k-centers such that at least one of the k-centers gives a (1 + epsilon) approximation w.r.t. the objective function above. In this paper, we show an upper bound on l by giving a randomized algorithm that outputs a list of 2((O) over tilde (k/epsilon)) k-centers. We also give a closely matching lower bound of 2((Omega) over tilde (k/root epsilon)) . Moreover, our algorithm runs in time O(nd . 2((O) over tilde (k/epsilon))) . This is a significant improvement over the previous result of Ding and Xu (2015) who gave an algorithm with running time O(nd . (log n)(k) . 2(poly(k/epsilon))) and output a list of size O((log n)(k) . 2(poly(k/)epsilon)). Our techniques generalize for the k-median problem and for many other settings where non-Euclidean distance measures are involved.
引用
收藏
页码:93 / 115
页数:23
相关论文
共 12 条
[1]   Clustering for Metric and Nonmetric Distance Measures [J].
Ackermann, Marcel R. ;
Bloemer, Johannes ;
Sohler, Christian .
ACM TRANSACTIONS ON ALGORITHMS, 2010, 6 (04)
[2]  
[Anonymous], 2003, P 35 ANN ACM S THEOR, DOI [10.1145/780542.780550, DOI 10.1145/780542.780550]
[3]  
[Anonymous], 2004, P 36 ANN ACM S THEOR
[4]  
Badoiu M., 2002, P THIRYFOURTH ANN AC, P250, DOI DOI 10.1145/509907.509947
[5]   On k-Median Clustering in High Dimensions [J].
Chen, Ke .
PROCEEDINGS OF THE SEVENTHEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2006, :1177-1185
[6]  
Ding H., 2015, Proc. 26th ACM-SIAM Symposium on Discrete Algorithms SODA'15, P1471
[7]  
Feldman Dan, 2007, Proc. 23rd ACM Symp. on Computational Geometry (SoCG), P11
[8]  
Inaba M., 1994, Proceedings of the Tenth Annual Symposium on Computational Geometry, P332, DOI 10.1145/177424.178042
[9]   Improved analysis of D2-sampling based PTAS for k-means and other clustering problems [J].
Jaiswal, Ragesh ;
Kumar, Mehul ;
Yadav, Pulkit .
INFORMATION PROCESSING LETTERS, 2015, 115 (02) :100-103
[10]   A Simple D 2-Sampling Based PTAS for k-Means and Other Clustering Problems [J].
Jaiswal, Ragesh ;
Kumar, Amit ;
Sen, Sandeep .
ALGORITHMICA, 2014, 70 (01) :22-46