Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data

被引:1
|
作者
Anna Koufakou
Jimmy Secretan
Michael Georgiopoulos
机构
[1] Florida Gulf Coast University,U.A. Whitaker School of Engineering
[2] University of Central Florida,School of Electrical Engineering and Computer Science
来源
关键词
Outlier detection; Anomaly detection; Frequent itemset mining; Non-Derivable itemsets; Categorical datasets;
D O I
暂无
中图分类号
学科分类号
摘要
Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a δ parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various δ values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values.
引用
收藏
页码:697 / 725
页数:28
相关论文
共 50 条
  • [1] Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data
    Koufakou, Anna
    Secretan, Jimmy
    Georgiopoulos, Michael
    KNOWLEDGE AND INFORMATION SYSTEMS, 2011, 29 (03) : 697 - 725
  • [2] Mining non-derivable frequent itemsets over data stream
    Li, Haifeng
    Chen, Hong
    DATA & KNOWLEDGE ENGINEERING, 2009, 68 (05) : 481 - 498
  • [3] Fast outlier detection algorithm for high dimensional categorical data streams
    Zhou, Xiao-Yun
    Sun, Zhi-Hui
    Zhang, Bai-Li
    Yang, Yi-Dong
    Ruan Jian Xue Bao/Journal of Software, 2007, 18 (04): : 933 - 942
  • [4] Weighted Outlier Detection of High-Dimensional Categorical Data Using Feature Grouping
    Li, Junli
    Zhang, Jifu
    Pang, Ning
    Qin, Xiao
    IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2020, 50 (11): : 4295 - 4308
  • [5] Outlier detection for high-dimensional data
    Ro, Kwangil
    Zou, Changliang
    Wang, Zhaojun
    Yin, Guosheng
    BIOMETRIKA, 2015, 102 (03) : 589 - 599
  • [6] Fast outlier detection for high-dimensional data of wireless sensor networks
    Qiao, Yan
    Cui, Xinhong
    Jin, Peng
    Zhang, Wu
    INTERNATIONAL JOURNAL OF DISTRIBUTED SENSOR NETWORKS, 2020, 16 (10)
  • [7] Intrinsic dimensional outlier detection in high-dimensional data
    Von Brünken, Jonathan
    Houle, Michael E.
    Zimek, Arthur
    NII Technical Reports, 2015, (03): : 1 - 12
  • [8] Efficient Outlier Detection for High-Dimensional Data
    Liu, Huawen
    Li, Xuelong
    Li, Jiuyong
    Zhang, Shichao
    IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2018, 48 (12): : 2451 - 2461
  • [9] Outlier mining in large high-dimensional data sets
    Angiulli, F
    Pizzuti, C
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (02) : 203 - 215
  • [10] A geometric framework for outlier detection in high-dimensional data
    Herrmann, Moritz
    Pfisterer, Florian
    Scheipl, Fabian
    WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2023, 13 (03)