Algorithm based on partition for outlier detection

被引:2
作者
School of Information Science and Engineering, Northeastern University, Shenyang 110006, China [1 ]
不详 [2 ]
机构
[1] School of Information Science and Engineering, Northeastern University
[2] School of Information and Control Engineering, Shenyang Jianzhu University
来源
Ruan Jian Xue Bao | 2006年 / 5卷 / 1009-1016期
关键词
CD-tree (cell dimension tree); Cell-based algorithm; Data mining; Outlier detection; Partition;
D O I
10.1360/jos171009
中图分类号
学科分类号
摘要
Outliers are objects that do not comply with the general behavior of the data. The method of partition divides data space into a set of non-overlapping rectangular cells by partitioning every dimension into equal length. Statistical information of cells is used to find knowledge in datasets. There exists very large data skew in real-life datasets, so partition will produce many empty cells, which affects the efficiency of the algorithms. An efficient index structure called CD-Tree (cell dimension tree) is designed for indexing cells. Moreover, to guide partition and to optimize the structure of CD-Tree, the concept of SOD (skew of data) is proposed to measure the degree of data skew. Finally, the CD-Tree-based algorithm is designed for outlier detection based on CD-Tree and SOD. The experimental results show that the efficiency of CD-Tree-based algorithm and the maximum number of dimensions processed increase obviously comparing with the Cell-based algorithm on real-life datasets.
引用
收藏
页码:1009 / 1016
页数:7
相关论文
共 8 条
[1]  
Knorr E., Ng R., Algorithms for mining distance-based outliers in large data sets, Proc. of the VLDB Conf, pp. 392-403, (1998)
[2]  
Knorr E., Ng R., Finding intensional knowledge of distance-based outliers, Proc. of the VLDB Conf., pp. 211-222, (1999)
[3]  
Ramaswamy S., Rastogi R., Shim K., Efficient algorithms for mining outliers from large data sets, Proc. of the ACM SIGMOD Conf., pp. 427-438, (2000)
[4]  
Breunig M.M., Kriegel H.P., Ng R., Sander J., LOF: Identifying density-based local outliers, Proc. of the ACM SIGMOD Conf., pp. 94-104, (2000)
[5]  
Arning A., Agrawal R., Raghavan P., A linear method for deviation detection in large databases, Proc. of the KDD Conf., pp. 164-169, (1996)
[6]  
Beckmann N., Kriegel H.P., Schneider R., Seeger B., The R*-tree: An efficient and robust access method for points and rectangles, Proc. of the ACM SIGMOD Conf., pp. 322-331, (1990)
[7]  
Katayama N., Satoh S., The SR-tree: An index structure for high-dimensional nearest neighbor queries, Proc. of the ACM SIGMOD Conf., pp. 369-380, (1997)
[8]  
Berchtold S., Keim D.A., Kriegel H., An index structure for high-dimensional data, Proc. of the 22nd VLDB Conf., pp. 28-39, (1996)