Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection

被引:170
作者
Radovanovic, Milos [1 ]
Nanopoulos, Alexandros [2 ]
Ivanovic, Mirjana [3 ]
机构
[1] Univ Novi Sad, Fac Sci, Dept Math & Informat, Novi Sad 21000, Serbia
[2] Univ Eichstaett Ingolstadt, Ingolstadt Sch Management, Ingolstadt, Germany
[3] Univ Novi Sad, Fac Sci, Novi Sad 21000, Serbia
关键词
Outlier detection; reverse nearest neighbors; high-dimensional data; distance concentration; ALGORITHMS;
D O I
10.1109/TKDE.2014.2365790
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Outlier detection in high-dimensional data presents various challenges resulting from the "curse of dimensionality." A prevailing view is that distance concentration, i.e., the tendency of distances in high-dimensional data to become indiscernible, hinders the detection of outliers by making distance-based methods label all points as almost equally good outliers. In this paper, we provide evidence supporting the opinion that such a view is too simple, by demonstrating that distance-based methods can produce more contrasting outlier scores in high-dimensional settings. Furthermore, we show that high dimensionality can have a different impact, by reexamining the notion of reverse nearest neighbors in the unsupervised outlier-detection context. Namely, it was recently observed that the distribution of points' reverse-neighbor counts becomes skewed in high dimensions, resulting in the phenomenon known as hubness. We provide insight into how some points (antihubs) appear very infrequently in k-NN lists of other points, and explain the connection between antihubs, outliers, and existing unsupervised outlier-detection methods. By evaluating the classic k-NN method, the angle-based technique designed for high-dimensional data, the density-based local outlier factor and influenced outlierness methods, and antihub-based methods on various synthetic and real-world data sets, we offer novel insight into the usefulness of reverse neighbor counts in unsupervised outlier detection.
引用
收藏
页码:1369 / 1382
页数:14
相关论文
共 41 条
[1]   Evaluation of Clusterings - Metrics and Visual Support [J].
Achtert, Elke ;
Goldhofer, Sascha ;
Kriegel, Hans-Peter ;
Schubert, Erich ;
Zimek, Arthur .
2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, :1285-1288
[2]  
Aggarwal C. C., 2001, SIGMOD Record, V30, P37, DOI 10.1145/376284.375668
[3]  
Aggarwal CC, 2001, LECT NOTES COMPUT SC, V1973, P420
[4]  
Bache K., 2014, UCI MACHINE LEARNING
[5]  
Beyer K, 1999, LECT NOTES COMPUT SC, V1540, P217
[6]   LOF: Identifying density-based local outliers [J].
Breunig, MM ;
Kriegel, HP ;
Ng, RT ;
Sander, J .
SIGMOD RECORD, 2000, 29 (02) :93-104
[7]  
Brugger T., 2007, KDD CUP 99 DATA SET
[8]  
Cao Lijun, 2010, Proceedings 2010 3rd International Symposium on Computational Intelligence and Design (ISCID 2010), P236, DOI 10.1109/ISCID.2010.149
[9]  
Cormen T. H., 2009, Introduction to Algorithms
[10]  
Ding Z, 2011, Diversified ensemble classifiers for highly imbalanced data learning and its application in bioinformatics