Bayesian variable selection in clustering high-dimensional data via a mixture of finite mixtures

被引：3

作者：

Doo, Woojin ^{[1
]}

Kim, Heeyoung ^{[1
]}

机构：

[1] Korea Adv Inst Sci & Technol KAIST, Dept Ind & Syst EngnR, Daejeon, South Korea

来源：

JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION | 2021年 / 91卷 / 12期

基金：

新加坡国家研究基金会;

关键词：

Bayesian inference; clustering; DNA microarray data; finite mixture model; high-dimensional data; variable selection;

D O I：

10.1080/00949655.2021.1902526

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

When clustering high-dimensional data, it is often important to identify variables that discriminate the clusters. Meanwhile, a common issue in clustering is to determine the number of clusters. In this study, we propose a new method that simultaneously performs clustering and variable selection, while inferring the number of clusters from the data. We formulate the clustering problem using a finite mixture model with a symmetric Dirichlet weights prior, while also placing a prior on the number of components. That is, we utilize a mixture of finite mixtures. We handle the variable selection problem by introducing a latent binary vector, which represents the inclusion/exclusion of variables. We update the binary vector for variable selection using a Metropolis algorithm and perform inference on the cluster structure using a split-merge Markov chain Monte Carlo technique. We demonstrate the advantage of our method using simulated and two real DNA microarray datasets.

引用

页码：2551 / 2568

页数：18

共 24 条

[1] Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J].

Alon, U ;

Barkai, N ;

Notterman, DA ;

Gish, K ;

Ybarra, S ;

Mack, D ;

Levine, AJ .

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) :6745-6750

[2]

Archetti F, WORLD SCI, P49

[3] FERGUSON DISTRIBUTIONS VIA POLYA URN SCHEMES [J].

BLACKWELL, D ;

MACQUEEN, JB .

ANNALS OF STATISTICS, 1973, 1 (02) :353-355

[4] Variable selection in model-based clustering and discriminant analysis with a regularization approach [J].

Celeux, Gilles ;

Maugis-Rabusseau, Cathy ;

Sedki, Mohammed .

ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2019, 13 (01) :259-278

[5] Comparison of discrimination methods for the classification of tumors using gene expression data [J].

Dudoit, S ;

Fridlyand, J ;

Speed, TP .

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (457) :77-87

[6] Clustering objects on subsets of attributes [J].

Friedman, JH ;

Meulman, JJ .

JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2004, 66 :815-839

[7] Probabilistic Community Detection With Unknown Number of Communities [J].

Geng, Junxian ;

Bhattacharya, Anirban ;

Pati, Debdeep .

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2019, 114 (526) :893-905

[8] Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring [J].

Golub, TR ;

Slonim, DK ;

Tamayo, P ;

Huard, C ;

Gaasenbeek, M ;

Mesirov, JP ;

Coller, H ;

Loh, ML ;

Downing, JR ;

Caligiuri, MA ;

Bloomfield, CD ;

Lander, ES .

SCIENCE, 1999, 286 (5439) :531-537

[9] Model-based subspace clustering [J].

Hoff, Peter D. .

BAYESIAN ANALYSIS, 2006, 1 (02) :321-344

[10] Gene extraction for cancer diagnosis by support vector machines - An improvement [J].

Huang, TM ;

Kecman, V .

ARTIFICIAL INTELLIGENCE IN MEDICINE, 2005, 35 (1-2) :185-194

← 1 2 3 →