Distance-based outlier detection for high dimension, low sample size data

被引:13
作者
Ahn, Jeongyoun [1 ]
Lee, Myung Hee [2 ]
Lee, Jung Ae [3 ]
机构
[1] Univ Georgia, Dept Stat, Athens, GA 30602 USA
[2] Weill Cornell Med Coll, Dept Med, Ctr Global Hlth, New York, NY USA
[3] Univ Arkansas, Agr Stat Lab, Fayetteville, AR 72701 USA
关键词
Centroid distance; HDLSS; high-dimensional asymptotics; maximal data piling distance; multiple outliers; GEOMETRIC REPRESENTATION; MODEL;
D O I
10.1080/02664763.2018.1452901
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Despite the popularity of high dimension, low sample size data analysis, there has not been enough attention to the sample integrity issue, in particular, a possibility of outliers in the data. A new outlier detection procedure for data with much larger dimensionality than the sample size is presented. The proposed method is motivated by asymptotic properties of high-dimensional distance measures. Empirical studies suggest that high-dimensional outlier detection is more likely to suffer from a swamping effect rather than a masking effect, thus yields more false positives than false negatives. We compare the proposed approaches with existing methods using simulated data from various population settings. A real data example is presented with a consideration on the implication of found outliers.
引用
收藏
页码:13 / 29
页数:17
相关论文
共 23 条
[1]   The high-dimension, low-sample-size geometric representation holds under mild conditions [J].
Ahn, Jeongyoun ;
Marron, J. S. ;
Muller, Keith M. ;
Chi, Yueh-Yun .
BIOMETRIKA, 2007, 94 (03) :760-766
[2]   CLUSTERING HIGH DIMENSION, LOW SAMPLE SIZE DATA USING THE MAXIMAL DATA PILING DISTANCE [J].
Ahn, Jeongyoun ;
Lee, Myung Hee ;
Yoon, Young Joo .
STATISTICA SINICA, 2012, 22 (02) :443-464
[3]   The maximal data piling direction for discrimination [J].
Ahn, Jeongyoun ;
Marron, J. S. .
BIOMETRIKA, 2010, 97 (01) :254-259
[4]  
[Anonymous], 1978, Outliers in statistical data
[5]   Adjustment of systematic microarray data biases [J].
Benito, M ;
Parker, J ;
Du, Q ;
Wu, JY ;
Xang, D ;
Perou, CM ;
Marron, JS .
BIOINFORMATICS, 2004, 20 (01) :105-114
[6]   Regularized estimation of large covariance matrices [J].
Bickel, Peter J. ;
Levina, Elizaveta .
ANNALS OF STATISTICS, 2008, 36 (01) :199-227
[7]   Adaptive Thresholding for Sparse Covariance Matrix Estimation [J].
Cai, Tony ;
Liu, Weidong .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2011, 106 (494) :672-684
[8]  
Fauconnier C., 2009, Statistical Methodology, V6, P363, DOI [DOI 10.1016/J.STAMET.2008.12.005, 10.1016/j.stamet.2008.12.005]
[9]   Outlier identification in high dimensions [J].
Filzmoser, Peter ;
Maronna, Ricardo ;
Werner, Mark .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2008, 52 (03) :1694-1711
[10]   THE CHI-SQUARE PLOT - A TOOL FOR MULTIVARIATE OUTLIER RECOGNITION [J].
GARRETT, RG .
JOURNAL OF GEOCHEMICAL EXPLORATION, 1989, 32 (1-3) :319-341