An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

被引:0
作者
Xi-Te Wang
De-Rong Shen
Mei Bai
Tie-Zheng Nie
Yue Kou
Ge Yu
机构
[1] Northeastern University,College of Information Science and Engineering
来源
Journal of Computer Science and Technology | 2015年 / 30卷
关键词
outlier detection; multi-dimensional; distributed; large dataset;
D O I
暂无
中图分类号
学科分类号
摘要
The distance-based outlier is a widely used definition of outlier. A point is distinguished as an outlier on the basis of the distances to its nearest neighbors. In this paper, to solve the problem of outlier computing in distributed environments, DBOZ, a distributed algorithm for distance-based outlier detection using Z-curve hierarchical tree (ZH-tree) is proposed. First, we propose a new index, ZH-tree, to effectively manage the data in a distributed environment. ZH-tree has two desirable advantages, including clustering property to help search the neighbors of a point, and hierarchical structure to support space pruning. We also design a bottom-up approach to build ZH-tree in parallel, whose time complexity is linear to the number of dimensions and the size of dataset. Second, DBOZ is proposed to compute outliers in distributed environments. It consists of two stages. 1) To avoid calculating the exact nearest neighbors of all the points, we design a greedy method and a new ZH-tree based k-nearest neighbor searching algorithm (ZHkNN for short) to obtain a threshold LW. 2) We propose a filter-and-refine approach, which first filters out the unpromising points using LW, and then outputs the final outliers through refining the remaining points. At last, the efficiency and the effectiveness of ZH-tree and DBOZ are testified through a series of experiments.
引用
收藏
页码:1233 / 1248
页数:15
相关论文
共 46 条
  • [1] Ramaswamy S(2000)Efficient algorithms for mining outliers from large data sets ACM SIGMOD Record 29 427-438
  • [2] Rastogi R(2005)Outlier mining in large high dimensional data sets IEEE Transactions on Knowledge and Data Engineering 17 203-215
  • [3] Shim K(2014)Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection Data Mining and Knowledge Discovery 28 190-237
  • [4] Angiulli F(2015)Scan++: Efficient algorithm for finding clusters, hubs and outliers on large-scale graphs Proceedings of the VLDB Endowment 8 1178-1189
  • [5] Pizzuti C(2001)Outlier detection for high dimensional data ACM SIGMOD Record 30 37-46
  • [6] Schubert E(2000)LOF: Identifying density-based local outliers ACM SIGMOD Record 29 93-104
  • [7] Zimek A(2003)Discovering cluster-based local outliers Pattern Recognition Letters 24 1641-1650
  • [8] Kriegel HP(1984)R-trees: A dynamic index structure for spatial searching ACM SIGMOD Record 14 47-57
  • [9] Shiokawa H(2010)Outlier detection over sliding windows for probabilistic data streams Journal of Computer Science and Technology 25 389-400
  • [10] Fujiwara Y(2014)Continuous outlier monitoring on uncertain data streams Journal of Computer Science and Technology 29 436-448