Faster Algorithms for the Constrained k-means Problem

被引：27

作者：

Bhattacharya, Anup ^{[1
]}

Jaiswal, Ragesh ^{[1
]}

Kumar, Amit ^{[1
]}

机构：

[1] Indian Inst Technol Delhi, Dept Comp Sci & Engn, New Delhi, India

来源：

THEORY OF COMPUTING SYSTEMS | 2018年 / 62卷 / 01期

关键词：

Constrained k-means clustering; D-2; sampling; CLUSTERING PROBLEMS; PTAS;

D O I：

10.1007/s00224-017-9820-7

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The classical center based clustering problems such as k-means/median/center assume that the optimal clusters satisfy the locality property that the points in the same cluster are close to each other. A number of clustering problems arise in machine learning where the optimal clusters do not follow such a locality property. For instance, consider the r -gather clustering problem where there is an additional constraint that each of the clusters should have at least r points or the capacitated clustering problem where there is an upper bound on the cluster sizes. Consider a variant of the k-means problem that may be regarded as a general version of such problems. Here, the optimal clusters O-1, ... , O-k are an arbitrary partition of the dataset and the goal is to output k-centers c(1), ... , c (k) such that the objective function Sigma(k)(i=1) Sigma(x is an element of Oi) parallel to x - c(i)parallel to(2) is minimized. It is not difficult to argue that any algorithm (without knowing the optimal clusters) that outputs a single set of k centers, will not behave well as far as optimizing the above objective function is concerned. However, this does not rule out the existence of algorithms that output a list of such k centers such that at least one of these k centers behaves well. Given an error parameter epsilon > 0, let l denote the size of the smallest list of k-centers such that at least one of the k-centers gives a (1 + epsilon) approximation w.r.t. the objective function above. In this paper, we show an upper bound on l by giving a randomized algorithm that outputs a list of 2((O) over tilde (k/epsilon)) k-centers. We also give a closely matching lower bound of 2((Omega) over tilde (k/root epsilon)) . Moreover, our algorithm runs in time O(nd . 2((O) over tilde (k/epsilon))) . This is a significant improvement over the previous result of Ding and Xu (2015) who gave an algorithm with running time O(nd . (log n)(k) . 2(poly(k/epsilon))) and output a list of size O((log n)(k) . 2(poly(k/)epsilon)). Our techniques generalize for the k-median problem and for many other settings where non-Euclidean distance measures are involved.

引用

页码：93 / 115

页数：23

共 12 条

[1] Clustering for Metric and Nonmetric Distance Measures [J].

Ackermann, Marcel R. ;

Bloemer, Johannes ;

Sohler, Christian .

ACM TRANSACTIONS ON ALGORITHMS, 2010, 6 (04)

[2]

[Anonymous], 2003, P 35 ANN ACM S THEOR, DOI [10.1145/780542.780550, DOI 10.1145/780542.780550]

[3]

[Anonymous], 2004, P 36 ANN ACM S THEOR

[4]

Badoiu M., 2002, P THIRYFOURTH ANN AC, P250, DOI DOI 10.1145/509907.509947

[5] On k-Median Clustering in High Dimensions [J].

Chen, Ke .

PROCEEDINGS OF THE SEVENTHEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2006, :1177-1185

[6]

Ding H., 2015, Proc. 26th ACM-SIAM Symposium on Discrete Algorithms SODA'15, P1471

[7]

Feldman Dan, 2007, Proc. 23rd ACM Symp. on Computational Geometry (SoCG), P11

[8]

Inaba M., 1994, Proceedings of the Tenth Annual Symposium on Computational Geometry, P332, DOI 10.1145/177424.178042

[9] Improved analysis of D2-sampling based PTAS for k-means and other clustering problems [J].

Jaiswal, Ragesh ;

Kumar, Mehul ;

Yadav, Pulkit .

INFORMATION PROCESSING LETTERS, 2015, 115 (02) :100-103

[10] A Simple D 2-Sampling Based PTAS for k-Means and Other Clustering Problems [J].

Jaiswal, Ragesh ;

Kumar, Amit ;

Sen, Sandeep .

ALGORITHMICA, 2014, 70 (01) :22-46

← 1 2 →