Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data

被引:1
|
作者
Anna Koufakou
Jimmy Secretan
Michael Georgiopoulos
机构
[1] Florida Gulf Coast University,U.A. Whitaker School of Engineering
[2] University of Central Florida,School of Electrical Engineering and Computer Science
来源
Knowledge and Information Systems | 2011年 / 29卷
关键词
Outlier detection; Anomaly detection; Frequent itemset mining; Non-Derivable itemsets; Categorical datasets;
D O I
暂无
中图分类号
学科分类号
摘要
Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a δ parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various δ values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values.
引用
收藏
页码:697 / 725
页数:28
相关论文
共 50 条
  • [41] High-dimensional data stream outlier detection algorithm based on angle distribution
    Lu, S. (lusheng@cqupt.edu.cn), 1600, Shanghai Jiaotong University (48):
  • [42] Homophily outlier detection in non-IID categorical data
    Guansong Pang
    Longbing Cao
    Ling Chen
    Data Mining and Knowledge Discovery, 2021, 35 : 1163 - 1224
  • [43] Homophily outlier detection in non-IID categorical data
    Pang, Guansong
    Cao, Longbing
    Chen, Ling
    DATA MINING AND KNOWLEDGE DISCOVERY, 2021, 35 (04) : 1163 - 1224
  • [44] An effective and efficient algorithm for high-dimensional outlier detection
    Charu C. Aggarwal
    Philip S. Yu
    The VLDB Journal, 2005, 14 : 211 - 221
  • [45] An effective and efficient algorithm for high-dimensional outlier detection
    Aggarwal, CC
    Yu, PS
    VLDB JOURNAL, 2005, 14 (02): : 211 - 221
  • [46] Feature Extraction for Outlier Detection in High-Dimensional Spaces
    Hoang Vu Nguyen
    Gopalkrishnan, Vivekanand
    PROCEEDINGS OF THE FOURTH INTERNATIONAL WORKSHOP ON FEATURE SELECTION IN DATA MINING, 2010, 10 : 66 - 75
  • [47] High-dimensional outlier detection using random projections
    Navarro-Esteban, P.
    Cuesta-Albertos, J. A.
    TEST, 2021, 30 (04) : 908 - 934
  • [48] High-dimensional outlier detection using random projections
    P. Navarro-Esteban
    J. A. Cuesta-Albertos
    TEST, 2021, 30 : 908 - 934
  • [49] Adaptive Clustering for Outlier Identification in High-Dimensional Data
    Thudumu, Srikanth
    Branch, Philip
    Jin, Jiong
    Singh, Jugdutt
    ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2019, PT II, 2020, 11945 : 215 - 228
  • [50] Outlier Detection in High Dimensional Data
    Kamalov, Firuz
    Leung, Ho Hon
    JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2020, 19 (01)