A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data

被引：119

作者：

Chen, Yewang ^{[1
,2
]}

Tang, Shengyu ^{[1
]}

Bouguila, Nizar ^{[2
]}

Wang, Cheng ^{[1
]}

Du, Jixiang ^{[1
]}

Li, HaiLin ^{[1
]}

机构：

[1] Huaqiao Univ, Coll Comp Sci & Technol, Jimei Ave 668, Xiamen, Fujian, Peoples R China

[2] Concordia Inst Informat Syst Engn, 1515 St,Catherine St West,EV007-632, Montreal, PQ H3G 2W1, Canada

来源：

PATTERN RECOGNITION | 2018年 / 83卷

基金：

美国国家科学基金会;

关键词：

DBSCAN; rho-Approximate DBSCAN; NQ-DBSCAN; NEAREST-NEIGHBOR; DENSITY; SEGMENTATION; QUANTIZATION; CHAMELEON; MODEL;

D O I：

10.1016/j.patcog.2018.05.030

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Clustering is an important technique to deal with large scale data which are explosively created in internet. Most data are high-dimensional with a lot of noise, which brings great challenges to retrieval, classification and understanding. No current existing approach is "optimal" for large scale data. For example, DBSCAN requires O(n(2)) time, Fast-DBSCAN only works well in 2 dimensions, and rho-Approximate DBSCAN runs in O(n) expected time which needs dimension D to be a relative small constant for the linear running time to hold. However, we prove theoretically and experimentally that p-Approximate DBSCAN degenerates to an O(n(2)) algorithm in very high dimension such that 2(D) > > n. In this paper, we propose a novel local neighborhood searching technique, and apply it to improve DBSCAN, named as NQ-DBSCAN, such that a large number of unnecessary distance computations can be effectively reduced. Theoretical analysis and experimental results show that NQ-DBSCAN averagely runs in O(n*log(n)) with the help of indexing technique, and the best case is O(n) if proper parameters are used, which makes it suitable for many realtime data. (C) 2018 Elsevier Ltd. All rights reserved.

引用

页码：375 / 387

页数：13

共 63 条

[1]

Ananthanarayanan V, 2015, ADV EDUC TECHNOL INS, P1, DOI 10.4018/978-1-4666-6461-6.ch001

[2] Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions [J].

Andoni, Alexandr ;

Indyk, Piotr .

COMMUNICATIONS OF THE ACM, 2008, 51 (01) :117-122

[3]

[Anonymous], BIOINFORMATICS

[4]

[Anonymous], ACM SIGMOD INT C MAN

[5]

[Anonymous], IMAGE VISION COMPUT

[6]

[Anonymous], GEOGRAPHIC DATA MIN

[7]

[Anonymous], CLUSTERING CLASSIFIC

[8]

[Anonymous], 2013, THESIS TU EINDHOVEN

[9]

[Anonymous], 2007, ACM Transactions on Knowledge Discovery from Data, DOI DOI 10.1145/1217299.1217303

[10] Fast density clustering strategies based on the k-means algorithm [J].

Bai, Liang ;

Cheng, Xueqi ;

Liang, Jiye ;

Shen, Huawei ;

Guo, Yike .

PATTERN RECOGNITION, 2017, 71 :375-386

← 1 2 3 4 5 6 7 →