Bayesian variable selection in clustering high-dimensional data via a mixture of finite mixtures

被引:3
作者
Doo, Woojin [1 ]
Kim, Heeyoung [1 ]
机构
[1] Korea Adv Inst Sci & Technol KAIST, Dept Ind & Syst EngnR, Daejeon, South Korea
基金
新加坡国家研究基金会;
关键词
Bayesian inference; clustering; DNA microarray data; finite mixture model; high-dimensional data; variable selection;
D O I
10.1080/00949655.2021.1902526
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
When clustering high-dimensional data, it is often important to identify variables that discriminate the clusters. Meanwhile, a common issue in clustering is to determine the number of clusters. In this study, we propose a new method that simultaneously performs clustering and variable selection, while inferring the number of clusters from the data. We formulate the clustering problem using a finite mixture model with a symmetric Dirichlet weights prior, while also placing a prior on the number of components. That is, we utilize a mixture of finite mixtures. We handle the variable selection problem by introducing a latent binary vector, which represents the inclusion/exclusion of variables. We update the binary vector for variable selection using a Metropolis algorithm and perform inference on the cluster structure using a split-merge Markov chain Monte Carlo technique. We demonstrate the advantage of our method using simulated and two real DNA microarray datasets.
引用
收藏
页码:2551 / 2568
页数:18
相关论文
共 24 条
[1]   Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J].
Alon, U ;
Barkai, N ;
Notterman, DA ;
Gish, K ;
Ybarra, S ;
Mack, D ;
Levine, AJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) :6745-6750
[2]  
Archetti F, WORLD SCI, P49
[3]   FERGUSON DISTRIBUTIONS VIA POLYA URN SCHEMES [J].
BLACKWELL, D ;
MACQUEEN, JB .
ANNALS OF STATISTICS, 1973, 1 (02) :353-355
[4]   Variable selection in model-based clustering and discriminant analysis with a regularization approach [J].
Celeux, Gilles ;
Maugis-Rabusseau, Cathy ;
Sedki, Mohammed .
ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2019, 13 (01) :259-278
[5]   Comparison of discrimination methods for the classification of tumors using gene expression data [J].
Dudoit, S ;
Fridlyand, J ;
Speed, TP .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (457) :77-87
[6]   Clustering objects on subsets of attributes [J].
Friedman, JH ;
Meulman, JJ .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2004, 66 :815-839
[7]   Probabilistic Community Detection With Unknown Number of Communities [J].
Geng, Junxian ;
Bhattacharya, Anirban ;
Pati, Debdeep .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2019, 114 (526) :893-905
[8]   Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring [J].
Golub, TR ;
Slonim, DK ;
Tamayo, P ;
Huard, C ;
Gaasenbeek, M ;
Mesirov, JP ;
Coller, H ;
Loh, ML ;
Downing, JR ;
Caligiuri, MA ;
Bloomfield, CD ;
Lander, ES .
SCIENCE, 1999, 286 (5439) :531-537
[9]   Model-based subspace clustering [J].
Hoff, Peter D. .
BAYESIAN ANALYSIS, 2006, 1 (02) :321-344
[10]   Gene extraction for cancer diagnosis by support vector machines - An improvement [J].
Huang, TM ;
Kecman, V .
ARTIFICIAL INTELLIGENCE IN MEDICINE, 2005, 35 (1-2) :185-194