Clustering under Approximation Stability

被引:34
作者
Balcan, Maria-Florina [1 ]
Blum, Avrim [2 ]
Gupta, Anupam [2 ]
机构
[1] Georgia Inst Technol, Sch Comp Sci, Atlanta, GA 30332 USA
[2] Carnegie Mellon Univ, Dept Comp Sci, Pittsburgh, PA 15213 USA
基金
美国国家科学基金会;
关键词
Algorithms; Theory; Clustering; Approximation Algorithms; k-Median; k-Means; Min-Sum; Clustering Accuracy; FACILITY LOCATION; ALGORITHMS;
D O I
10.1145/2450142.2450144
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
A common approach to clustering data is to view data objects as points in a metric space, and then to optimize a natural distance-based objective such as the k-median, k-means, or min-sum score. For applications such as clustering proteins by function or clustering images by subject, the implicit hope in taking this approach is that the optimal solution for the chosen objective will closely match the desired "target" clustering (e. g., a correct clustering of proteins by function or of images by who is in them). However, most distance-based objectives, including those mentioned here, are NP-hard to optimize. So, this assumption by itself is not sufficient, assuming P not equal NP, to achieve clusterings of low-error via polynomial time algorithms. In this article, we show that we can bypass this barrier if we slightly extend this assumption to ask that for some small constant c, not only the optimal solution, but also all c-approximations to the optimal solution, differ from the target on at most some epsilon fraction of points-we call this (c, epsilon)-approximation-stability. We show that under this condition, it is possible to efficiently obtain low-error clusterings even if the property holds only for values c for which the objective is known to be NP-hard to approximate. Specifically, for any constant c > 1, (c, epsilon)-approximation-stability of k-median or k-means objectives can be used to efficiently produce a clustering of error O(epsilon) with respect to the target clustering, as can stability of the min-sum objective if the target clusters are sufficiently large. Thus, we can perform nearly as well in terms of agreement with the target clustering as if we could approximate these objectives to this NP-hard value.
引用
收藏
页数:34
相关论文
共 62 条
[1]  
[Anonymous], ABS09063162 CORR
[2]  
[Anonymous], P 51 ANN IEEE S FDN
[3]  
[Anonymous], P 31 ANN ACM S THEOR
[4]  
[Anonymous], P 40 ANN S FDN COMP
[5]  
[Anonymous], P 16 ANN INT COMP CO
[6]  
[Anonymous], 2009, P 22 ANN C LEARN THE
[7]  
[Anonymous], P NEUR INF PROC SYST
[8]  
[Anonymous], P 24 INT S THEOR ASP
[9]  
[Anonymous], P 18 ANN C LEARN THE
[10]  
[Anonymous], P 25 ACM SIGMOD SIGA