A Communication Efficient Parallel DBSCAN Algorithm Based on Parameter Server

被引:7
作者
Hu, Xu [1 ]
Huang, Jun [1 ]
Qiu, Minghui [1 ]
机构
[1] Alibaba Grp, Hangzhou, Zhejiang, Peoples R China
来源
CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT | 2017年
关键词
Density-Based clustering; Parallel DBSCAN; Parameter Server;
D O I
10.1145/3132847.3133112
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent benchmark studies show that MPI-based distributed implementations of DBSCAN, e.g., PDSDBSCAN, outperform other implementations such as apache Spark etc. However, the communication cost of MPI DBSCAN increases drastically with the number of processors, which makes it inefficient for large scale problems. In this paper, we propose PS-DBSCAN, a parallel DBSCAN algorithm that combines the disjoint-set data structure and Parameter Server framework, to minimize communication cost. Since data points within the same cluster may be distributed over different workers which result in several disjoint-sets, merging them incurs large communication costs. In our algorithm, we employ a fast global union approach to union the disjoint-sets to alleviate the communication burden. Experiments over the datasets of different scales demonstrate that PS-DBSCAN outperforms the PDSDBSCAN with 2-10 times speedup on communication efficiency. We have released our PS-DBSCAN in an algorithm platform called Platform of AI (PAI)(1) in Alibaba Cloud.
引用
收藏
页码:2107 / 2110
页数:4
相关论文
共 16 条
[1]  
[Anonymous], 2014, OPERATING SYSTEMS DE
[2]  
Brecheisen Stefan, 2006, PAKDD
[3]  
Chen Cen, 2017, WWW 17
[4]   High-performance data mining with skeleton-based structured parallel programming [J].
Coppola, M ;
Vanneschi, M .
PARALLEL COMPUTING, 2002, 28 (05) :793-813
[5]  
Cordova I, 2015, PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS 2015), P531, DOI 10.1109/HPCSim.2015.7237086
[6]  
Ester M., 1996, KDD-96 Proceedings. Second International Conference on Knowledge Discovery and Data Mining, P226
[7]   Research on Parallel DBSCAN Algorithm Design Based on MapReduce [J].
Fu Yan Xiang ;
Zhao Wei Zhong ;
Ma Hui Fang .
ADVANCED MEASUREMENT AND TEST, PTS 1-3, 2011, 301-303 :1133-+
[8]   DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation [J].
Gan, Junhao ;
Tao, Yufei .
SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, :519-530
[9]  
Gotz M., 2015, Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments. MLHPC'15, p2:1
[10]   MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce [J].
He, Yaobin ;
Tan, Haoyu ;
Luo, Wuman ;
Mao, Huajian ;
Ma, Di ;
Feng, Shengzhong ;
Fan, Jianping .
2011 IEEE 17TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2011, :473-480