Model-Based Clustering with Measurement or Estimation Errors

被引：9

作者：

Zhang, Wanli ^{[1
,2
]}

Di, Yanming ^{[1
]}

机构：

[1] Oregon State Univ, Dept Stat, Corvallis, OR 97330 USA

[2] Eli Lilly & Co, Shanghai 200021, Peoples R China

来源：

GENES | 2020年 / 11卷 / 02期

基金：

美国国家卫生研究院;

关键词：

gaussian finite mixture model; clustering analysis; uncertainty; expectation-maximization algorithm; classification boundary; gene expression; RNA-seq; ALGORITHM;

D O I：

10.3390/genes11020185

中图分类号：

Q3 [遗传学];

学科分类号：

071007 ; 090102 ;

摘要：

Model-based clustering with finite mixture models has become a widely used clustering method. One of the recent implementations is MCLUST. When objects to be clustered are summary statistics, such as regression coefficient estimates, they are naturally associated with estimation errors, whose covariance matrices can often be calculated exactly or approximated using asymptotic theory. This article proposes an extension to Gaussian finite mixture modeling-called MCLUST-ME-that properly accounts for the estimation errors. More specifically, we assume that the distribution of each observation consists of an underlying true component distribution and an independent measurement error distribution. Under this assumption, each unique value of estimation error covariance corresponds to its own classification boundary, which consequently results in a different grouping from MCLUST. Through simulation and application to an RNA-Seq data set, we discovered that under certain circumstances, explicitly, modeling estimation errors, improves clustering performance or provides new insights into the data, compared with when errors are simply ignored, whereas the degree of improvement depends on factors such as the distribution of error covariance matrices.

引用

页数：23

共 25 条

[1] [Anonymous], 2019, R foundation for statistical computing
[2] MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING
BANFIELD, JD
RAFTERY, AE
[J]. BIOMETRICS, 1993, 49 (03) : 803 - 821
[3] Bouveyron C, 2019, CA ST PR MA, V50, P1, DOI 10.1017/9781108644181
[4] A LIMITED MEMORY ALGORITHM FOR BOUND CONSTRAINED OPTIMIZATION
BYRD, RH
LU, PH
NOCEDAL, J
ZHU, CY
[J]. SIAM JOURNAL ON SCIENTIFIC COMPUTING, 1995, 16 (05) : 1190 - 1208
[5] GAUSSIAN PARSIMONIOUS CLUSTERING MODELS
CELEUX, G
GOVAERT, G
[J]. PATTERN RECOGNITION, 1995, 28 (05) : 781 - 793
[6] Detecting features in spatial point processes with clutter via model-based clustering
Dasgupta, A
Raftery, AE
[J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1998, 93 (441) : 294 - 302
[7] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM
DEMPSTER, AP
LAIRD, NM
RUBIN, DB
[J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01): : 1 - 38
[8] Single-gene negative binomial regression models for RNA-Seq data with higher-order asymptotic inference
Di, Yanming
[J]. STATISTICS AND ITS INTERFACE, 2015, 8 (04) : 405 - 418
[9] SOME APPLICATIONS OF MATRIX DERIVATIVES IN MULTIVARIATE ANALYSIS
DWYER, PS
[J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1967, 62 (318) : 607 - &
[10] Model-based clustering, discriminant analysis, and density estimation
Fraley, C
Raftery, AE
[J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (458) : 611 - 631

← 1 2 3 →