Variable selection in regression mixture modeling for the discovery of gene regulatory networks

被引:32
作者
Gupta, Mayetri [1 ]
Ibrahim, Joseph G. [1 ]
机构
[1] Univ N Carolina, Dept Biostat, Chapel Hill, NC 27599 USA
基金
美国国家卫生研究院;
关键词
Bayesian model selection; evolutionary Monte Carlo; hierarchical model; importance sampling; motif discovery; transcription regulation;
D O I
10.1198/016214507000000068
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
The profusion of genomic data through genome sequencing and gene expression microarray technology has facilitated statistical research in determining gene interactions regulating a biological process. Current methods generally consist of a two-stage procedure: clustering gene expression measurements and searching for regulatory "switches," typically short, conserved sequence patterns (motifs) in the DNA sequence adjacent to the genes. This process often leads to misleading conclusions as incorrect cluster selection may lead to missing important regulatory motifs or making many false discoveries. Treating cluster memberships as known, rather than estimated, introduces bias into analyses, preventing uncertainty about cluster parameters. Further, there is underutilization of the available data, as the sequence information is ignored for purposes of expression clustering and vice versa. We propose a way to address these issues by combining gene clustering and motif discovery in a unified framework, a mixture of hierarchical regression models, with unknown components representing the latent gene clusters, and genomic sequence features linked to the resultant gene expression through a multivariate hierarchical regression. We demonstrate a Monte Carlo method for simultaneous variable selection (for motifs) and clustering (for genes). The selection of the number of components in the mixture is addressed by computing the analytically intractable Bayes factor through a novel multistage mixture importance sampling approach. This methodology is used to analyze a yeast cell cycle dataset to determine an optimal set of motifs that discriminates between groups of genes and simultaneously finds the most significant gene clusters.
引用
收藏
页码:867 / 880
页数:14
相关论文
共 27 条
[1]  
Bailey TL, 2004, P 2 INT C INT SYST M, P28
[2]  
Ball CA, 2005, NUCLEIC ACIDS RES, V33, pD580
[3]   Context-specific Bayesian clustering for gene expression data [J].
Barash, Y ;
Friedman, N .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2002, 9 (02) :169-191
[4]   Characterizing gene sets with FuncAssociate [J].
Berriz, GF ;
King, OD ;
Bryant, B ;
Sander, C ;
Roth, FP .
BIOINFORMATICS, 2003, 19 (18) :2502-2504
[5]   Regulatory element detection using correlation with expression [J].
Bussemaker, HJ ;
Li, H ;
Siggia, ED .
NATURE GENETICS, 2001, 27 (02) :167-171
[6]   Marginal likelihood from the Gibbs output [J].
Chib, S .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1995, 90 (432) :1313-1321
[7]   Integrating regulatory motif discovery and genome-wide expression analysis [J].
Conlon, EM ;
Liu, XS ;
Lieb, JD ;
Liu, JS .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2003, 100 (06) :3339-3344
[8]   Enhanced model-based clustering, density estimation, and discriminant analysis software: MCLUST [J].
Fraley, C ;
Raftery, AE .
JOURNAL OF CLASSIFICATION, 2003, 20 (02) :263-286
[9]  
Gelman A., 1992, Statistical Science, V7, DOI [DOI 10.1214/SS/1177011136, 10.1214/ss/1177011136]
[10]   VARIABLE SELECTION VIA GIBBS SAMPLING [J].
GEORGE, EI ;
MCCULLOCH, RE .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1993, 88 (423) :881-889