A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data

被引:114
作者
Chen, Yewang [1 ,2 ]
Tang, Shengyu [1 ]
Bouguila, Nizar [2 ]
Wang, Cheng [1 ]
Du, Jixiang [1 ]
Li, HaiLin [1 ]
机构
[1] Huaqiao Univ, Coll Comp Sci & Technol, Jimei Ave 668, Xiamen, Fujian, Peoples R China
[2] Concordia Inst Informat Syst Engn, 1515 St,Catherine St West,EV007-632, Montreal, PQ H3G 2W1, Canada
基金
美国国家科学基金会;
关键词
DBSCAN; rho-Approximate DBSCAN; NQ-DBSCAN; NEAREST-NEIGHBOR; DENSITY; SEGMENTATION; QUANTIZATION; CHAMELEON; MODEL;
D O I
10.1016/j.patcog.2018.05.030
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Clustering is an important technique to deal with large scale data which are explosively created in internet. Most data are high-dimensional with a lot of noise, which brings great challenges to retrieval, classification and understanding. No current existing approach is "optimal" for large scale data. For example, DBSCAN requires O(n(2)) time, Fast-DBSCAN only works well in 2 dimensions, and rho-Approximate DBSCAN runs in O(n) expected time which needs dimension D to be a relative small constant for the linear running time to hold. However, we prove theoretically and experimentally that p-Approximate DBSCAN degenerates to an O(n(2)) algorithm in very high dimension such that 2(D) > > n. In this paper, we propose a novel local neighborhood searching technique, and apply it to improve DBSCAN, named as NQ-DBSCAN, such that a large number of unnecessary distance computations can be effectively reduced. Theoretical analysis and experimental results show that NQ-DBSCAN averagely runs in O(n*log(n)) with the help of indexing technique, and the best case is O(n) if proper parameters are used, which makes it suitable for many realtime data. (C) 2018 Elsevier Ltd. All rights reserved.
引用
收藏
页码:375 / 387
页数:13
相关论文
共 63 条
  • [1] Ananthanarayanan V, 2015, ADV EDUC TECHNOL INS, P1, DOI 10.4018/978-1-4666-6461-6.ch001
  • [2] Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
    Andoni, Alexandr
    Indyk, Piotr
    [J]. COMMUNICATIONS OF THE ACM, 2008, 51 (01) : 117 - 122
  • [3] [Anonymous], BIOINFORMATICS
  • [4] [Anonymous], ACM SIGMOD INT C MAN
  • [5] [Anonymous], IMAGE VISION COMPUT
  • [6] [Anonymous], GEOGRAPHIC DATA MIN
  • [7] [Anonymous], CLUSTERING CLASSIFIC
  • [8] [Anonymous], 2013, THESIS TU EINDHOVEN
  • [9] [Anonymous], 2007, ACM Transactions on Knowledge Discovery from Data, DOI DOI 10.1145/1217299.1217303
  • [10] Fast density clustering strategies based on the k-means algorithm
    Bai, Liang
    Cheng, Xueqi
    Liang, Jiye
    Shen, Huawei
    Guo, Yike
    [J]. PATTERN RECOGNITION, 2017, 71 : 375 - 386