Efficient Outlier Detection for High-Dimensional Data

被引:55
作者
Liu, Huawen [1 ]
Li, Xuelong [2 ,3 ]
Li, Jiuyong [4 ]
Zhang, Shichao [5 ]
机构
[1] Zhejiang Normal Univ, Dept Comp Sci, Jinhua 321004, Peoples R China
[2] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Shaanxi, Peoples R China
[3] Northwestern Polytech Univ, Ctr OPT IMagery Anal & Learning OPTIMAL, Xian 710072, Shaanxi, Peoples R China
[4] Univ South Australia, Sch Informat Technol & Math Sci, Adelaide, SA 5095, Australia
[5] Guangxi Normal Univ, Dept Comp Sci, Guilin 541004, Peoples R China
来源
IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS | 2018年 / 48卷 / 12期
基金
中国国家自然科学基金;
关键词
Dimension reduction; high-dimensional data; k nearest neighbors (kNN); low-rank approximation; outlier detection; RANK;
D O I
10.1109/TSMC.2017.2718220
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
How to tackle high dimensionality of data effectively and efficiently is still a challenging issue in machine learning. Identifying anomalous objects from given data has a broad range of real-world applications. Although many classical outlier detection or ranking algorithms have been witnessed during the past years, the high-dimensional problem, as well as the size of neighborhood, in outlier detection have not yet attracted sufficient attention. The former may trigger the distance concentration problem that the distances of observations in high-dimensional space tend to be indiscernible, whereas the latter requires appropriate values for parameters, making models high complex and more sensitive. To partially circumvent these problems, especially the high dimensionality, we introduce a concept called local projection score (LPS) to represent deviation degree of an observation to its neighbors. The LPS is obtained from the neighborhood information by the technique of low-rank approximation. The observation with high LPS is a promising candidate of outlier in high probability. Based on this notion, we propose an efficient and effective outlier detection algorithm, which is also robust to the parameter k of k nearest neighbors. Extensive evaluation experiments conducted on twelve public real-world data sets with five popular outlier detection algorithms show that the performance of the proposed method is competitive and promising.
引用
收藏
页码:2451 / 2461
页数:11
相关论文
共 38 条
[1]   Anomaly Detection Using Model Generation for Event-Based Systems Without a Preexisting Formal Model [J].
Allen, Lindsay V. ;
Tilbury, Dawn M. .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART A-SYSTEMS AND HUMANS, 2012, 42 (03) :654-668
[2]  
Angiulli F., 2002, Principles of Data Mining and Knowledge Discovery. 6th European Conference, PKDD 2002. Proceedings (Lecture Notes in Artificial Intelligence Vol.2431), P15
[3]  
[Anonymous], 2013, Outlier Analysis, DOI [DOI 10.1007/978-1-4614-6396-2, 10.1007/978-1-4614-6396-2]
[4]   An optimal algorithm for approximate nearest neighbor searching in fixed dimensions [J].
Arya, S ;
Mount, DM ;
Netanyahu, NS ;
Silverman, R ;
Wu, AY .
JOURNAL OF THE ACM, 1998, 45 (06) :891-923
[5]  
Banerjee S., 2014, LINEAR ALGEBRA MATRI
[7]   LOF: Identifying density-based local outliers [J].
Breunig, MM ;
Kriegel, HP ;
Ng, RT ;
Sander, J .
SIGMOD RECORD, 2000, 29 (02) :93-104
[8]  
Eskin E., 2000, In Proceedings of the International Conference on Machine Learning, P255, DOI DOI 10.1109/ICCSA.2008.70
[9]   Outlier Detection for Temporal Data: A Survey [J].
Gupta, Manish ;
Gao, Jing ;
Aggarwal, Charu C. ;
Han, Jiawei .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (09) :2250-2267
[10]   A precise ranking method for outlier detection [J].
Ha, Jihyun ;
Seok, Seulgi ;
Lee, Jong-Seok .
INFORMATION SCIENCES, 2015, 324 :88-107