Clustering using objective functions and stochastic search

被引:58
作者
Booth, James G. [1 ]
Casella, George [2 ]
Hobert, James P. [2 ]
机构
[1] Cornell Univ, Dept Biol Stat & Computat Biol, Ithaca, NY 14850 USA
[2] Univ Florida, Gainesville, FL USA
关键词
Bayesian model; best linear unbiased predictor; cluster analysis; linear mixed model; Markov chain Monte Carlo methods; Metropolis-Hastings algorithm; microarray; quadratic penalized splines; set partition; yeast cell cycle;
D O I
10.1111/j.1467-9868.2007.00629.x
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
A new approach to clustering multivariate data, based on a multilevel linear mixed model, is proposed. A key feature of the model is that observations from the same cluster are correlated, because they share cluster-specific random effects. The inclusion of cluster-specific random effects allows parsimonious departure from an assumed base model for cluster mean profiles. This departure is captured statistically via the posterior expectation, or best linear unbiased predictor. One of the parameters in the model is the true underlying partition of the data, and the posterior distribution of this parameter, which is known up to a normalizing constant, is used to cluster the data. The problem of finding partitions with high posterior probability is not amenable to deterministic methods such as the EM algorithm. Thus, we propose a stochastic search algorithm that is driven by a Markov chain that is a mixture of two Metropolis-Hastings algorithms-one that makes small scale changes to individual objects and another that performs large scale moves involving entire clusters. The methodology proposed is fundamentally different from the well-known finite mixture model approach to clustering, which does not explicitly include the partition as a parameter, and involves an independent and identically distributed structure.
引用
收藏
页码:119 / 139
页数:21
相关论文
共 30 条
[1]  
Agresti A., 1990, Analysis of categorical data
[2]   MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].
BANFIELD, JD ;
RAFTERY, AE .
BIOMETRICS, 1993, 49 (03) :803-821
[3]   BAYESIAN COMPUTATION AND STOCHASTIC-SYSTEMS [J].
BESAG, J ;
GREEN, P ;
HIGDON, D ;
MENGERSEN, K .
STATISTICAL SCIENCE, 1995, 10 (01) :3-41
[4]  
BINDER DA, 1978, BIOMETRIKA, V65, P31, DOI 10.2307/2335273
[5]  
BOOTH JG, 2004, CLUSTERING PERIODICA
[6]   Mixture of linear mixed models for clustering gene expression profiles from repeated microarray experiments [J].
Celeux, G ;
Martin, O ;
Lavergne, C .
STATISTICAL MODELLING, 2005, 5 (03) :243-267
[7]   A CLASSIFICATION EM ALGORITHM FOR CLUSTERING AND 2 STOCHASTIC VERSIONS [J].
CELEUX, G ;
GOVAERT, G .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 1992, 14 (03) :315-332
[8]   A BAYESIAN METHOD FOR COMBINING RESULTS FROM SEVERAL BINOMIAL EXPERIMENTS [J].
CONSONNI, G ;
VERONESE, P .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1995, 90 (431) :935-944
[9]   Product partition models for normal means [J].
Crowley, EM .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1997, 92 (437) :192-198
[10]   Model-based clustering, discriminant analysis, and density estimation [J].
Fraley, C ;
Raftery, AE .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (458) :611-631