A distributed approximate nearest neighbors algorithm for efficient large scale mean shift clustering

被引：24

作者：

Beck, Gael ^{[1
]}

Duong, Tarn ^{[1
]}

Lebbah, Mustapha ^{[1
]}

Azzag, Hanane ^{[1
]}

Cerin, Christophe ^{[1
]}

机构：

[1] Univ Paris 13, LIPN CNRS UMR 7030, Comp Sci Lab Paris North, F-93430 Villetaneuse, France

来源：

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING | 2019年 / 134卷

关键词：

Clustering; Gradient ascent; Nearest neighbors; Spark; MapReduce; DENSITY; MAPREDUCE;

D O I：

10.1016/j.jpdc.2019.07.015

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Mean Shift clustering, as a generalization of the well-known k-means clustering, computes arbitrarily shaped clusters as defined as the basins of attraction to the local modes created by the density gradient ascent paths. Despite its potential for improved clustering accuracy, the Mean Shift approach is a computationally expensive method for unsupervised learning. We introduce two contributions aiming to provide approximate Mean Shift clustering, based on scalable procedures to compute the density gradient ascent and cluster labeling, with a linear time complexity, as opposed to the quadratic time complexity for the exact clustering. Both propositions are based on Locality Sensitive Hashing (LSH) to approximate nearest neighbors. When implemented on a serial system, these approximate methods can be used for moderate sized datasets. To facilitate the analysis of Big Data, a distributed implementation, written for the Spark/Scala ecosystem is proposed. An added benefit is that our proposed approximations of the density gradient ascent, when used as a pre-processing step in other clustering methods, can also improve the clustering accuracy of the latter. We present experimental results illustrating the effect of tuning parameters on cluster labeling accuracy and execution times, as well as the potential to solve concrete problems in Big Clustering. (C) 2019 Elsevier Inc. All rights reserved.

引用

页码：128 / 139

页数：12

共 29 条

[1]

Aggarwal CC, 2014, CH CRC DATA MIN KNOW, P1

[2] Scalable K-Means++ [J].

Bahmani, Bahman ;

Moseley, Benjamin ;

Vattani, Andrea ;

Kumar, Ravi ;

Vassilvitskii, Sergei .

PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (07) :622-633

[3]

Beck G, 2016, IEEE IJCNN, P3110, DOI 10.1109/IJCNN.2016.7727595

[4]

BISHOP C. M., 2006, Pattern recognition and machine learning, DOI [DOI 10.1117/1.2819119, 10.1007/978-0-387-45528-0]

[5] MEAN SHIFT, MODE SEEKING, AND CLUSTERING [J].

CHENG, YZ .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1995, 17 (08) :790-799

[6]

Cui G.Z.Y., 2011, PROCEDIA ENG, P265

[7] Histograms of oriented gradients for human detection [J].

Dalal, N ;

Triggs, B .

2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, :886-893

[8] Comparing community structure identification -: art. no. P09008 [J].

Danon, L ;

Díaz-Guilera, A ;

Duch, J ;

Arenas, A .

JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2005, :219-228

[9]

Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137

[10] Nearest neighbour estimators of density derivatives, with application to mean shift clustering [J].

Duong, Tarn ;

Beck, Gael ;

Azzag, Hanene ;

Lebbah, Mustapha .

PATTERN RECOGNITION LETTERS, 2016, 80 :224-230

← 1 2 3 →