Scalable Mining of Contextual Outliers Using Relevant Subspace

被引:16
作者
Zhang, Jifu [1 ]
Yu, Xiaolong [1 ]
Xun, Yaling [1 ]
Zhang, Sulan [1 ]
Qin, Xiao [2 ]
机构
[1] Taiyuan Univ Sci & Technol, Taiyuan 030024, Peoples R China
[2] Auburn Univ, Samuel Ginn Coll Engn, Dept Comp Sci & Software Engn, Auburn, AL 36849 USA
来源
IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS | 2020年 / 50卷 / 03期
基金
中国国家自然科学基金;
关键词
Interpretability; local outlier factor; local sensitive hashing; relevant subspace; scalable mining; ALGORITHMS;
D O I
10.1109/TSMC.2017.2718592
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we propose a scalable mining algorithm to discover contextual outliers using relevant subspaces. We develop the mining algorithm using the MapReduce programming model running on a Hadoop cluster. Relevant subspaces, which effectively capture the local distribution of various datasets, are quantified using local sparseness of attribute dimensions. We design a novel way of calculating local outlier factors in a relevant subspace with the probability density of local datasets; this new approach can effectively reflect the outlier degree of a data object that does not satisfy the distribution of the local dataset in the relevant subspace. Attribute dimensions of a relevant subspace, and local outlier factors are expressed as vital contextual information, which improves the interpretability of outliers. Importantly, the selection of N data objects with the largest local outlier factor value is categorized as contextual outliers in our solution. To this end, our scalable mining algorithm, which incorporates the locality sensitive hashing distributed strategy, is implemented on a Hadoop cluster. The experimental results validate the effectiveness, interpretability, scalability, and extensibility of the algorithm using both synthetic data and stellar spectral data as experimental datasets.
引用
收藏
页码:988 / 1002
页数:15
相关论文
共 35 条
[1]   An effective and efficient algorithm for high-dimensional outlier detection [J].
Aggarwal, CC ;
Yu, PS .
VLDB JOURNAL, 2005, 14 (02) :211-221
[2]  
Aggarwal CC, 2001, SIGMOD RECORD, V30, P37
[3]   Distributed Strategies for Mining Outliers in Large Data Sets [J].
Angiulli, Fabrizio ;
Basta, Stefano ;
Lodi, Stefano ;
Sartori, Claudio .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (07) :1520-1532
[4]  
[Anonymous], 2008, P 14 ACM SIGKDD INT
[5]  
[Anonymous], 1994, Outliers in statistical data
[6]  
[Anonymous], 2012, KDD, DOI 10.1145/2339530.2339669
[7]  
[Anonymous], 2002, P 8 ACM SIGKDD INT C, DOI DOI 10.1145/775047.775103
[8]  
[Anonymous], [No title captured]
[9]  
[Anonymous], 2009, P 18 ACM C INF KNOWL
[10]   Mining Projected Clusters in High-Dimensional Spaces [J].
Bouguessa, Mohamed ;
Wang, Shengrui .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2009, 21 (04) :507-522