Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection

被引：170

作者：

Radovanovic, Milos ^{[1
]}

Nanopoulos, Alexandros ^{[2
]}

Ivanovic, Mirjana ^{[3
]}

机构：

[1] Univ Novi Sad, Fac Sci, Dept Math & Informat, Novi Sad 21000, Serbia

[2] Univ Eichstaett Ingolstadt, Ingolstadt Sch Management, Ingolstadt, Germany

[3] Univ Novi Sad, Fac Sci, Novi Sad 21000, Serbia

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2015年 / 27卷 / 05期

关键词：

Outlier detection; reverse nearest neighbors; high-dimensional data; distance concentration; ALGORITHMS;

D O I：

10.1109/TKDE.2014.2365790

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Outlier detection in high-dimensional data presents various challenges resulting from the "curse of dimensionality." A prevailing view is that distance concentration, i.e., the tendency of distances in high-dimensional data to become indiscernible, hinders the detection of outliers by making distance-based methods label all points as almost equally good outliers. In this paper, we provide evidence supporting the opinion that such a view is too simple, by demonstrating that distance-based methods can produce more contrasting outlier scores in high-dimensional settings. Furthermore, we show that high dimensionality can have a different impact, by reexamining the notion of reverse nearest neighbors in the unsupervised outlier-detection context. Namely, it was recently observed that the distribution of points' reverse-neighbor counts becomes skewed in high dimensions, resulting in the phenomenon known as hubness. We provide insight into how some points (antihubs) appear very infrequently in k-NN lists of other points, and explain the connection between antihubs, outliers, and existing unsupervised outlier-detection methods. By evaluating the classic k-NN method, the angle-based technique designed for high-dimensional data, the density-based local outlier factor and influenced outlierness methods, and antihub-based methods on various synthetic and real-world data sets, we offer novel insight into the usefulness of reverse neighbor counts in unsupervised outlier detection.

引用

页码：1369 / 1382

页数：14

共 41 条

[1] Evaluation of Clusterings - Metrics and Visual Support [J].

Achtert, Elke ;

Goldhofer, Sascha ;

Kriegel, Hans-Peter ;

Schubert, Erich ;

Zimek, Arthur .

2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, :1285-1288

[2]

Aggarwal C. C., 2001, SIGMOD Record, V30, P37, DOI 10.1145/376284.375668

[3]

Aggarwal CC, 2001, LECT NOTES COMPUT SC, V1973, P420

[4]

Bache K., 2014, UCI MACHINE LEARNING

[5]

Beyer K, 1999, LECT NOTES COMPUT SC, V1540, P217

[6] LOF: Identifying density-based local outliers [J].

Breunig, MM ;

Kriegel, HP ;

Ng, RT ;

Sander, J .

SIGMOD RECORD, 2000, 29 (02) :93-104

[7]

Brugger T., 2007, KDD CUP 99 DATA SET

[8]

Cao Lijun, 2010, Proceedings 2010 3rd International Symposium on Computational Intelligence and Design (ISCID 2010), P236, DOI 10.1109/ISCID.2010.149

[9]

Cormen T. H., 2009, Introduction to Algorithms

[10]

Ding Z, 2011, Diversified ensemble classifiers for highly imbalanced data learning and its application in bioinformatics

← 1 2 3 4 5 →