Computationally Efficient Outlier Detection for High-Dimensional Data Using the MDP Algorithm

被引:0
作者
Tsagris, Michail [1 ]
Papadakis, Manos [2 ]
Alenazi, Abdulaziz [2 ]
Alzeley, Omar [3 ]
机构
[1] Univ Crete, Dept Econ, Gallos Campus, Rethimnon 74100, Greece
[2] Northern Border Univ, Coll Sci, Dept Math, Ar Ar 73213, Saudi Arabia
[3] Umm Al Qura Univ, Al Qunfudah Univ Coll, Dept Math, Mecca 24382, Saudi Arabia
关键词
high-dimensional data; outliers; computational efficiency; 6208;
D O I
10.3390/computation12090185
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Outlier detection, or anomaly detection as it is known in the machine learning community, has gained interest in recent years, and it is commonly used when the sample size is smaller than the number of variables. In 2015, an outlier detection procedure was proposed 7 for this high-dimensional setting, replacing the classic minimum covariance determinant estimator with the minimum diagonal product estimator. Computationally speaking, their method has two drawbacks: (a) it is not computationally efficient and does not scale up, and (b) it is not memory efficient and, in some cases, it is not possible to apply due to memory limits. We address the first issue via efficient code written in both R and C++, whereas for the second issue, we utilize the eigen decomposition and its properties. Experiments are conducted using simulated data to showcase the time improvement, while gene expression data are used to further examine some extra practicalities associated with the algorithm. The simulation studies yield a speed-up factor that ranges between 17 and 1800, implying a successful reduction in the estimator's computational burden.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] A survey on unsupervised subspace outlier detection methods for high dimensional data
    Ahn, Jaehyeong
    Kwon, Sunghoon
    KOREAN JOURNAL OF APPLIED STATISTICS, 2021, 34 (03) : 507 - 521
  • [32] The xyz algorithm for fast interaction search in high-dimensional data
    Thanei, Gian-Andrea
    Meinshausen, Nicolai
    Shah, Rajen D.
    JOURNAL OF MACHINE LEARNING RESEARCH, 2018, 19
  • [33] Persistent homology based clustering algorithm for high-dimensional data
    Xiong Z.
    Wei Y.
    Xiong Z.
    He K.
    Huazhong Keji Daxue Xuebao (Ziran Kexue Ban)/Journal of Huazhong University of Science and Technology (Natural Science Edition), 2024, 52 (02): : 29 - 35
  • [34] A Clustering Algorithm for High-Dimensional Nonlinear Feature Data with Applications
    Jiang H.
    Wang G.
    Gao J.
    Gao Z.
    Gao R.
    Guo Q.
    Hsi-An Chiao Tung Ta Hsueh/Journal of Xi'an Jiaotong University, 2017, 51 (12): : 49 - 55and90
  • [35] Contextual anomaly detection for high-dimensional data using Dirichlet process variational autoencoder
    Kim, Hyojoong
    Kim, Heeyoung
    IISE TRANSACTIONS, 2023, 55 (05) : 433 - 444
  • [36] Outlier-resistant high-dimensional regression modelling based on distribution-free outlier detection and tuning parameter selection
    Park, Heewon
    JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2017, 87 (09) : 1799 - 1812
  • [37] Outlier Robust Geodesic K-means Algorithm for High Dimensional Data
    Hassanzadeh, Aidin
    Kaarna, Arto
    Kauranne, Tuomo
    STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, S+SSPR 2016, 2016, 10029 : 252 - 262
  • [38] Forecasting the Japanese macroeconomy using high-dimensional data
    Nakajima, Yoshiki
    Sueishi, Naoya
    JAPANESE ECONOMIC REVIEW, 2022, 73 (02) : 299 - 324
  • [39] Similarity joins for high-dimensional data using Spark
    Rong, Chuitian
    Cheng, Xiaohai
    Chen, Ziliang
    Huo, Na
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2019, 31 (20)
  • [40] High-dimensional data monitoring using support machines
    Maboudou-Tchao, Edgard M.
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2021, 50 (07) : 1927 - 1942