Overlapping correlation clustering

被引:42
作者
Bonchi, Francesco [1 ]
Gionis, Aristides [1 ]
Ukkonen, Antti [1 ]
机构
[1] Yahoo Res, Barcelona 08018, Spain
关键词
Algorithms; Clustering; Overlapping clustering; Correlation clustering; Pregel;
D O I
10.1007/s10115-012-0522-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce a new approach for finding overlapping clusters given pairwise similarities of objects. In particular, we relax the problem of correlation clustering by allowing an object to be assigned to more than one cluster. At the core of our approach is an optimization problem in which each data point is mapped to a small set of labels, representing membership in different clusters. The objective is to find a mapping so that the given similarities between objects agree as much as possible with similarities taken over their label sets. The number of labels can vary across objects. To define a similarity between label sets, we consider two measures: (i) a 0-1 function indicating whether the two label sets have non-zero intersection and (ii) the Jaccard coefficient between the two label sets. The algorithm we propose is an iterative local-search method. The definitions of label set similarity give rise to two non-trivial optimization problems, which, for the measures of set-intersection and Jaccard, we solve using a greedy strategy and non-negative least squares, respectively. We also develop a distributed version of our algorithm based on the BSP model and implement it using a Pregel framework. Our algorithm uses as input pairwise similarities of objects and can thus be applied when clustering structured objects for which feature vectors are not available. As a proof of concept, we apply our algorithms on three different and complex application domains: trajectories, amino-acid sequences, and textual documents.
引用
收藏
页码:1 / 32
页数:32
相关论文
共 58 条
[1]  
Ailon N, 2009, AUT LANG PROGR 36 IN
[2]  
Ailon N, 2005, P ACM S THEOR COMP S
[3]   SimClus: an effective algorithm for clustering with a lower bound on similarity [J].
Al Hasan, Mohammad ;
Salem, Saeed ;
Zaki, Mohammed J. .
KNOWLEDGE AND INFORMATION SYSTEMS, 2011, 28 (03) :665-685
[4]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[5]  
[Anonymous], P 4 SIAM INT C DAT M
[6]  
[Anonymous], 2001, P 18 INT C MACH LEAR
[7]  
[Anonymous], P 2010 ACM SIGMOD IN, DOI [DOI 10.1145/1807167.1807184, 10.1145/1807167.1807184]
[8]  
[Anonymous], 1979, Computers and Intractablity: A Guide to the Theory of NP-Completeness
[9]  
[Anonymous], P 10 ACM SIGKDD INT
[10]  
[Anonymous], P SIAM DAT MIN C