Algorithm based on partition for outlier detection

被引：2

作者：

School of Information Science and Engineering, Northeastern University, Shenyang 110006, China ^{[1
]}

不详 ^{[2
]}

机构：

[1] School of Information Science and Engineering, Northeastern University

[2] School of Information and Control Engineering, Shenyang Jianzhu University

来源：

Ruan Jian Xue Bao | 2006年 / 5卷 / 1009-1016期

关键词：

CD-tree (cell dimension tree); Cell-based algorithm; Data mining; Outlier detection; Partition;

D O I：

10.1360/jos171009

中图分类号：

学科分类号：

摘要：

Outliers are objects that do not comply with the general behavior of the data. The method of partition divides data space into a set of non-overlapping rectangular cells by partitioning every dimension into equal length. Statistical information of cells is used to find knowledge in datasets. There exists very large data skew in real-life datasets, so partition will produce many empty cells, which affects the efficiency of the algorithms. An efficient index structure called CD-Tree (cell dimension tree) is designed for indexing cells. Moreover, to guide partition and to optimize the structure of CD-Tree, the concept of SOD (skew of data) is proposed to measure the degree of data skew. Finally, the CD-Tree-based algorithm is designed for outlier detection based on CD-Tree and SOD. The experimental results show that the efficiency of CD-Tree-based algorithm and the maximum number of dimensions processed increase obviously comparing with the Cell-based algorithm on real-life datasets.

引用

页码：1009 / 1016

页数：7

共 8 条

[1]

Knorr E., Ng R., Algorithms for mining distance-based outliers in large data sets, Proc. of the VLDB Conf, pp. 392-403, (1998)

[2]

Knorr E., Ng R., Finding intensional knowledge of distance-based outliers, Proc. of the VLDB Conf., pp. 211-222, (1999)

[3]

Ramaswamy S., Rastogi R., Shim K., Efficient algorithms for mining outliers from large data sets, Proc. of the ACM SIGMOD Conf., pp. 427-438, (2000)

[4]

Breunig M.M., Kriegel H.P., Ng R., Sander J., LOF: Identifying density-based local outliers, Proc. of the ACM SIGMOD Conf., pp. 94-104, (2000)

[5]

Arning A., Agrawal R., Raghavan P., A linear method for deviation detection in large databases, Proc. of the KDD Conf., pp. 164-169, (1996)

[6]

Beckmann N., Kriegel H.P., Schneider R., Seeger B., The R*-tree: An efficient and robust access method for points and rectangles, Proc. of the ACM SIGMOD Conf., pp. 322-331, (1990)

[7]

Katayama N., Satoh S., The SR-tree: An index structure for high-dimensional nearest neighbor queries, Proc. of the ACM SIGMOD Conf., pp. 369-380, (1997)

[8]

Berchtold S., Keim D.A., Kriegel H., An index structure for high-dimensional data, Proc. of the 22nd VLDB Conf., pp. 28-39, (1996)

← 1 →