A distributed approximate nearest neighbors algorithm for efficient large scale mean shift clustering

被引:24
作者
Beck, Gael [1 ]
Duong, Tarn [1 ]
Lebbah, Mustapha [1 ]
Azzag, Hanane [1 ]
Cerin, Christophe [1 ]
机构
[1] Univ Paris 13, LIPN CNRS UMR 7030, Comp Sci Lab Paris North, F-93430 Villetaneuse, France
关键词
Clustering; Gradient ascent; Nearest neighbors; Spark; MapReduce; DENSITY; MAPREDUCE;
D O I
10.1016/j.jpdc.2019.07.015
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Mean Shift clustering, as a generalization of the well-known k-means clustering, computes arbitrarily shaped clusters as defined as the basins of attraction to the local modes created by the density gradient ascent paths. Despite its potential for improved clustering accuracy, the Mean Shift approach is a computationally expensive method for unsupervised learning. We introduce two contributions aiming to provide approximate Mean Shift clustering, based on scalable procedures to compute the density gradient ascent and cluster labeling, with a linear time complexity, as opposed to the quadratic time complexity for the exact clustering. Both propositions are based on Locality Sensitive Hashing (LSH) to approximate nearest neighbors. When implemented on a serial system, these approximate methods can be used for moderate sized datasets. To facilitate the analysis of Big Data, a distributed implementation, written for the Spark/Scala ecosystem is proposed. An added benefit is that our proposed approximations of the density gradient ascent, when used as a pre-processing step in other clustering methods, can also improve the clustering accuracy of the latter. We present experimental results illustrating the effect of tuning parameters on cluster labeling accuracy and execution times, as well as the potential to solve concrete problems in Big Clustering. (C) 2019 Elsevier Inc. All rights reserved.
引用
收藏
页码:128 / 139
页数:12
相关论文
共 29 条
[1]  
Aggarwal CC, 2014, CH CRC DATA MIN KNOW, P1
[2]   Scalable K-Means++ [J].
Bahmani, Bahman ;
Moseley, Benjamin ;
Vattani, Andrea ;
Kumar, Ravi ;
Vassilvitskii, Sergei .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (07) :622-633
[3]  
Beck G, 2016, IEEE IJCNN, P3110, DOI 10.1109/IJCNN.2016.7727595
[4]  
BISHOP C. M., 2006, Pattern recognition and machine learning, DOI [DOI 10.1117/1.2819119, 10.1007/978-0-387-45528-0]
[5]   MEAN SHIFT, MODE SEEKING, AND CLUSTERING [J].
CHENG, YZ .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1995, 17 (08) :790-799
[6]  
Cui G.Z.Y., 2011, PROCEDIA ENG, P265
[7]   Histograms of oriented gradients for human detection [J].
Dalal, N ;
Triggs, B .
2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, :886-893
[8]   Comparing community structure identification -: art. no. P09008 [J].
Danon, L ;
Díaz-Guilera, A ;
Duch, J ;
Arenas, A .
JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2005, :219-228
[9]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[10]   Nearest neighbour estimators of density derivatives, with application to mean shift clustering [J].
Duong, Tarn ;
Beck, Gael ;
Azzag, Hanene ;
Lebbah, Mustapha .
PATTERN RECOGNITION LETTERS, 2016, 80 :224-230