Efficient Distributed Density Peaks for Clustering Large Data Sets in MapReduce

被引:51
作者
Zhang, Yanfeng [1 ]
Chen, Shimin [2 ]
Yu, Ge [1 ]
机构
[1] Northeastern Univ, Sch Comp Sci & Engn, Shenyang 110819, Peoples R China
[2] Chinese Acad Sci, State Key Lab Comp Architecture, Inst Comp Technol, Beijing 100000, Peoples R China
基金
中国国家自然科学基金;
关键词
Density peaks; distributed clustering; MapReduce;
D O I
10.1109/TKDE.2016.2609423
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Density Peaks (DP) is a recently proposed clustering algorithm that has distinctive advantages over existing clustering algorithms. It has already been used in a wide range of applications. However, DP requires computing the distance between every pair of input points, therefore incurring quadratic computation overhead, which is prohibitive for large data sets. In this paper, we study efficient distributed algorithms for DP. We first show that a naive MapReduce solution (Basic-DDP) has high communication and computation overhead. Then, we propose LSH-DDP, an approximate algorithm that exploits Locality Sensitive Hashing for partitioning data, performs local computation, and aggregates local results to approximate the final results. We address several challenges in employing LSH for DP. We leverage the characteristics of DP to deal with the fact that some of the result values cannot be directly approximated in local partitions. We present formal analysis of LSH-DDP, and show that the approximation quality and the runtime can be controlled by tuning the parameters of LSH-DDP. Experimental results on both a local cluster and EC2 show that LSH-DDP achieves a factor of 1.7-70x speedup over the na_ ive Basic-DDP and 2x speedup over the state-of-the-art EDDPC approach, while returning comparable cluster results. Compared to the popular K-means clustering, LSH-DDP also has comparable or better performance. Furthermore, LSH-DDP could achieve even higher efficiency with a lower accuracy requirement.
引用
收藏
页码:3218 / 3230
页数:13
相关论文
共 40 条
[1]  
[Anonymous], LEADING TREE DP CLUS
[2]  
[Anonymous], 2011, P 20 ACM INT C INFOR
[3]  
[Anonymous], 2010, LARGE SCALE DISTRIB
[4]  
[Anonymous], P 2 INT C KNOWL DISC
[5]  
[Anonymous], P 23 ACM INT C INF K
[6]  
[Anonymous], DATABASE SYSTEMADV
[7]  
[Anonymous], 2015, ARXIV150104267
[8]  
[Anonymous], 2015, ARXIV150505610
[9]  
[Anonymous], P 33 INT C VER LARG
[10]  
[Anonymous], IEEE T INFORM THEORY