Neighborhood relevant outlier detection approach based on information entropy

被引：4

作者：

Yu, Qingying ^{[1
,2
]}

Luo, Yonglong ^{[1
,2
]}

Chen, Chuanming ^{[2
]}

Bian, Weixin ^{[2
]}

机构：

[1] Anhui Normal Univ, Sch Territorial Resources & Tourism, 189 South Rd Jiuhua Rd, Wuhu 241003, Anhui, Peoples R China

[2] Anhui Normal Univ, Sch Math & Comp Sci, Wuhu, Anhui, Peoples R China

来源：

INTELLIGENT DATA ANALYSIS | 2016年 / 20卷 / 06期

基金：

中国国家自然科学基金;

关键词：

Outlier detection; information entropy; attribute weights; pruning; k-nearest neighborhood relevant outlier factor (kNNROF); DISTANCE-BASED OUTLIERS;

D O I：

10.3233/IDA-150301

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Outlier detection is an interesting issue in data mining and machine learning. In this paper, to detect outliers, an information-entropy-based k-nearest neighborhood relevant outlier factor algorithm is proposed that is combined with Shannon information theory and the triangle pruning strategy. The algorithm accounts for the data points whose k-nearest neighbors are distributed on the edge of the range within the designated radius. In particular, the neighborhood influence on each point is considered to address the problem of information concealment and submergence. Information entropy is used to calculate the weights to distinguish the importance of each attribute. Then, based on the attribute weights, the improved pruning strategy reduces the computational complexity of the subsequent procedures by removing some inliers and obtaining the outlier candidate dataset. Finally, according to the weighted distance between the objects in the candidate dataset and those in the original dataset, the algorithm calculates the dissimilarity between each object and its k-nearest neighbors. The data points with the top r dissimilarity are regarded as the outliers. Experimental results show that, compared

引用

页码：1247 / 1265

页数：19

共 34 条

[1]

Achtert Elke, 2011, Advances in Spatial and Temporal Databases. Proceedings 12th International Symposium (SSTD 2011), P512, DOI 10.1007/978-3-642-22922-0_41

[2]

Aggarwal C.C., 2002, ACM SIGMOD RECORD, V30, P37

[3]

Aggarwal CC., 2013, OUTLIER ANAL, P373, DOI 10.1007/978-1-4614-6396-212

[4] Distance-based detection and prediction of outliers [J].

Angiulli, F ;

Basta, S ;

Pizzuti, C .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2006, 18 (02) :145-160

[5] Outlier mining in large high-dimensional data sets [J].

Angiulli, F ;

Pizzuti, C .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (02) :203-215

[6]

Angiulli F., 2002, Principles of Data Mining and Knowledge Discovery. 6th European Conference, PKDD 2002. Proceedings (Lecture Notes in Artificial Intelligence Vol.2431), P15

[7] DOLPHIN: An Efficient Algorithm for Mining Distance-Based Outliers in Very Large Datasets [J].

Angiulli, Fabrizio ;

Fassetti, Fabio .

ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2009, 3 (01)

[8]

[Anonymous], 2004, PROC 2 INT WORKSHOP

[9]

Barnett V., 1978, Outliers in statistical data

[10]

Bay S.D, 2003, KDD 03, P29, DOI [10.1145/956750.956758, DOI 10.1145/956750.956758]

← 1 2 3 4 →