Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data

被引：1

作者：

Anna Koufakou

Jimmy Secretan

Michael Georgiopoulos

机构：

[1] Florida Gulf Coast University,U.A. Whitaker School of Engineering

[2] University of Central Florida,School of Electrical Engineering and Computer Science

来源：

Knowledge and Information Systems | 2011年 / 29卷

关键词：

Outlier detection; Anomaly detection; Frequent itemset mining; Non-Derivable itemsets; Categorical datasets;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a δ parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various δ values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values.

引用

页码：697 / 725

页数：28

共 50 条

[1] Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data
Koufakou, Anna
Secretan, Jimmy
Georgiopoulos, Michael
KNOWLEDGE AND INFORMATION SYSTEMS, 2011, 29 (03) : 697 - 725
[2] Mining non-derivable frequent itemsets over data stream
Li, Haifeng
Chen, Hong
DATA & KNOWLEDGE ENGINEERING, 2009, 68 (05) : 481 - 498
[3] Fast outlier detection algorithm for high dimensional categorical data streams
Zhou, Xiao-Yun
Sun, Zhi-Hui
Zhang, Bai-Li
Yang, Yi-Dong
Ruan Jian Xue Bao/Journal of Software, 2007, 18 (04): : 933 - 942
[4] Weighted Outlier Detection of High-Dimensional Categorical Data Using Feature Grouping
Li, Junli
Zhang, Jifu
Pang, Ning
Qin, Xiao
IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2020, 50 (11): : 4295 - 4308
[5] Outlier detection for high-dimensional data
Ro, Kwangil
Zou, Changliang
Wang, Zhaojun
Yin, Guosheng
BIOMETRIKA, 2015, 102 (03) : 589 - 599
[6] Fast outlier detection for high-dimensional data of wireless sensor networks
Qiao, Yan
Cui, Xinhong
Jin, Peng
Zhang, Wu
INTERNATIONAL JOURNAL OF DISTRIBUTED SENSOR NETWORKS, 2020, 16 (10)
[7] Intrinsic dimensional outlier detection in high-dimensional data
Von Brünken, Jonathan
Houle, Michael E.
Zimek, Arthur
NII Technical Reports, 2015, (03): : 1 - 12
[8] Efficient Outlier Detection for High-Dimensional Data
Liu, Huawen
Li, Xuelong
Li, Jiuyong
Zhang, Shichao
IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2018, 48 (12): : 2451 - 2461
[9] Outlier mining in large high-dimensional data sets
Angiulli, F
Pizzuti, C
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (02) : 203 - 215
[10] A geometric framework for outlier detection in high-dimensional data
Herrmann, Moritz
Pfisterer, Florian
Scheipl, Fabian
WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2023, 13 (03)

← 1 2 3 4 5 →