Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection

被引:454
作者
Campello, Ricardo J. G. B. [1 ]
Moulavi, Davoud [2 ]
Zimek, Arthur [3 ]
Sander, Joerg [4 ]
机构
[1] Univ Sao Paulo, Dept Comp Sci, BR-05508 Sao Paulo, Brazil
[2] Univ Alberta, Dept Comp Sci, Edmonton, AB T6G 2M7, Canada
[3] Univ Munich, D-81377 Munich, Germany
[4] Univ Alberta, Dept Comp Sci, Edmonton, AB T6G 2M7, Canada
基金
巴西圣保罗研究基金会; 加拿大自然科学与工程研究理事会;
关键词
Data Mining; Clustering; Algorithms; Density-based clustering; hierarchical and nonhierarchical clustering; unsupervised and semisupervised clustering; data visualization; outlier detection; global/local outliers; DISTANCE-BASED OUTLIERS; ALGORITHMS; PROJECTIONS; EXTRACTION; EFFICIENCY; FEATURES; NUMBER; TREE;
D O I
10.1145/2733381
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
An integrated framework for density-based cluster analysis, outlier detection, and data visualization is introduced in this article. The main module consists of an algorithm to compute hierarchical estimates of the level sets of a density, following Hartigan's classic model of density-contour clusters and trees. Such an algorithm generalizes and improves existing density-based clustering techniques with respect to different aspects. It provides as a result a complete clustering hierarchy composed of all possible density-based clusters following the nonparametric model adopted, for an infinite range of density thresholds. The resulting hierarchy can be easily processed so as to provide multiple ways for data visualization and exploration. It can also be further postprocessed so that: (i) a normalized score of "outlierness" can be assigned to each data object, which unifies both the global and local perspectives of outliers into a single definition; and (ii) a "flat" (i.e., nonhierarchical) clustering solution composed of clusters extracted from local cuts through the cluster tree (possibly corresponding to different density thresholds) can be obtained, either in an unsupervised or in a semisupervised way. In the unsupervised scenario, the algorithm corresponding to this postprocessing module provides a global, optimal solution to the formal problem of maximizing the overall stability of the extracted clusters. If partially labeled objects or instance-level constraints are provided by the user, the algorithm can solve the problem by considering both constraints violations/satisfactions and cluster stability criteria. An asymptotic complexity analysis, both in terms of running time and memory space, is described. Experiments are reported that involve a variety of synthetic and real datasets, including comparisons with state-of-the-art, density-based clustering and (global and local) outlier detection methods.
引用
收藏
页数:51
相关论文
共 127 条
[1]  
Abe N., 2006, P 12 ACM SIGKDD INT, V2006, P504, DOI [10.1145/1150402.1150459, DOI 10.1145/1150402.1150459]
[2]  
Achtert E., 2013, Proceedings of the ACM SIGMOD International Conference on Management of Data, P1009, DOI DOI 10.1145/2463676.2463696
[3]  
Aggarwal CC, 2001, SIGMOD RECORD, V30, P37
[4]   A comprehensive survey of numeric and symbolic outlier mining techniques [J].
Agyemang, Malik ;
Barker, Ken ;
Alhajj, Rada .
INTELLIGENT DATA ANALYSIS, 2006, 10 (06) :521-538
[5]  
Angiulli F., 2002, Principles of Data Mining and Knowledge Discovery. 6th European Conference, PKDD 2002. Proceedings (Lecture Notes in Artificial Intelligence Vol.2431), P15
[6]   DOLPHIN: An Efficient Algorithm for Mining Distance-Based Outliers in Very Large Datasets [J].
Angiulli, Fabrizio ;
Fassetti, Fabio .
ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2009, 3 (01)
[7]  
Ankerst M, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P49
[8]  
[Anonymous], NUMERICAL TAXONOMY
[9]  
[Anonymous], 1996, SIGMOD REC ACM SPEC, DOI DOI 10.1145/235968.233324
[10]  
[Anonymous], CONSTRAINT CLUSTERIN