An Efficient Approach for Outlier Detection with Imperfect Data Labels

被引：44

作者：

Liu, Bo ^{[1
]}

Xiao, Yanshan ^{[2
]}

Yu, Philip S. ^{[3
,4
]}

Hao, Zhifeng ^{[2
]}

Cao, Longbing ^{[5
]}

机构：

[1] Guangdong Univ Technol, Dept Automat, Guangzhou 510006, Guangdong, Peoples R China

[2] Guangdong Univ Technol, Dept Comp Sci, Guangzhou 510006, Guangdong, Peoples R China

[3] Univ Illinois, Dept Comp Sci, Chicago, IL 60607 USA

[4] King Abdulaziz Univ, Dept Comp Sci, Jeddah, Saudi Arabia

[5] Univ Technol Sydney, Fac Engn & Informat Technol, Sydney, NSW 2007, Australia

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2014年 / 26卷 / 07期

基金：

美国国家科学基金会; 澳大利亚研究理事会;

关键词：

Outlier detection; data of uncertainty; SUPPORT VECTOR MACHINES; DISTANCE-BASED OUTLIERS; CLASSIFICATION;

D O I：

10.1109/TKDE.2013.108

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The task of outlier detection is to identify data objects that are markedly different from or inconsistent with the normal set of data. Most existing solutions typically build a model using the normal data and identify outliers that do not fit the represented model very well. However, in addition to normal data, there also exist limited negative examples or outliers in many applications, and data may be corrupted such that the outlier detection data is imperfectly labeled. These make outlier detection far more difficult than the traditional ones. This paper presents a novel outlier detection approach to address data with imperfect labels and incorporate limited abnormal examples into learning. To deal with data with imperfect labels, we introduce likelihood values for each input data which denote the degree of membership of an example toward the normal and abnormal classes respectively. Our proposed approach works in two steps. In the first step, we generate a pseudo training dataset by computing likelihood values of each example based on its local behavior. We present kernel k-means clustering method and kernel LOF-based method to compute the likelihood values. In the second step, we incorporate the generated likelihood values and limited abnormal examples into SVDD-based learning framework to build a more accurate classifier for global outlier detection. By integrating local and global outlier detection, our proposed method explicitly handles data with imperfect labels and enhances the performance of outlier detection. Extensive experiments on real life datasets have demonstrated that our proposed approaches can achieve a better tradeoff between detection rate and false alarm rate as compared to state-of-the-art outlier detection approaches.

引用

页码：1602 / 1616

页数：15

共 48 条

[1]

Aggarwal C. C., 2008, SDM, P483, DOI 10.1137/1.9781611972788.44

[2] A Survey of Uncertain Data Algorithms and Applications [J].

Aggarwal, Charu C. ;

Yu, Philip S. .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2009, 21 (05) :609-623

[3] Applying support vector machines to imbalanced datasets [J].

Akbani, R ;

Kwek, S ;

Japkowicz, N .

MACHINE LEARNING: ECML 2004, PROCEEDINGS, 2004, 3201 :39-50

[4] DOLPHIN: An Efficient Algorithm for Mining Distance-Based Outliers in Very Large Datasets [J].

Angiulli, Fabrizio ;

Fassetti, Fabio .

ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2009, 3 (01)

[5]

[Anonymous], P 5 ANN C ADV SCH CO

[6]

[Anonymous], P WORKSH MACH LEARN

[7]

[Anonymous], 2006, Proceedings of the 12th international conference on Knowledge discovery and data mining

[8]

Barnett V., 1994, Outliers in statistical data

[9]

Bhaduri K., 2011, Proceedings of the 17th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, P859, DOI DOI 10.1145/2020408.2020554

[10] An agent based and biological inspired real-time intrusion detection and security model for computer network operations [J].

Boukerche, Azzedine ;

Machado, Renato B. ;

Juca, Kathia R. L. ;

Sobral, Joao Bosco M. ;

Notare, Mirela S. M. A. .

COMPUTER COMMUNICATIONS, 2007, 30 (13) :2649-2660

← 1 2 3 4 5 →