Outlier Identification in Model-Based Cluster Analysis

被引:16
作者
Evans, Katie [1 ,3 ]
Love, Tanzy [2 ]
Thurston, Sally W. [2 ]
机构
[1] DuPont Co Inc, DuET Appl Stat, Wilmington, DE USA
[2] Univ Rochester, Dept Biostat & Computat Biol, Rochester, NY 14627 USA
[3] Univ Rochester, Rochester, NY 14627 USA
关键词
Normal-mixture models; Influential points; MCLUST; Prior; National Hockey League;
D O I
10.1007/s00357-015-9171-5
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
In model-based clustering based on normal-mixture models, a few outlying observations can influence the cluster structure and number. This paper develops a method to identify these, however it does not attempt to identify clusters amidst a large field of noisy observations. We identify outliers as those observations in a cluster with minimal membership proportion or for which the cluster-specific variance with and without the observation is very different. Results from a simulation study demonstrate the ability of our method to detect true outliers without falsely identifying many non-outliers and improved performance over other approaches, under most scenarios. We use the contributed R package MCLUST for model-based clustering, but propose a modified prior for the cluster-specific variance which avoids degeneracies in estimation procedures. We also compare results from our outlier method to published results on National Hockey League data.
引用
收藏
页码:63 / 84
页数:22
相关论文
共 22 条
[1]  
[Anonymous], 1989, STAT SCI, V4, P34
[2]  
[Anonymous], 2006, TECHNICAL REPORT
[3]   MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].
BANFIELD, JD ;
RAFTERY, AE .
BIOMETRICS, 1993, 49 (03) :803-821
[4]   LOF: Identifying density-based local outliers [J].
Breunig, MM ;
Kriegel, HP ;
Ng, RT ;
Sander, J .
SIGMOD RECORD, 2000, 29 (02) :93-104
[5]   Nearest-neighbor clutter removal for estimating features in spatial point processes [J].
Byers, S ;
Raftery, AE .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1998, 93 (442) :577-584
[6]   GAUSSIAN PARSIMONIOUS CLUSTERING MODELS [J].
CELEUX, G ;
GOVAERT, G .
PATTERN RECOGNITION, 1995, 28 (05) :781-793
[7]   Maximum likelihood estimation of heterogeneous mixtures of Gaussian and uniform distributions [J].
Coretto, Pietro ;
Hennig, Christian .
JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2011, 141 (01) :462-473
[8]   MCLUST: Software for model-based cluster analysis [J].
Fraley, C ;
Raftery, AE .
JOURNAL OF CLASSIFICATION, 1999, 16 (02) :297-306
[9]  
Fraley C, 2007, J CLASSIF, V24, P155, DOI [10.1007/s00357-007-0004-5, 10.1007/s00357-007-0004-z]
[10]   Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator [J].
Hardin, J ;
Rocke, DM .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2004, 44 (04) :625-638