VARIABLE SELECTION FOR SPARSE DIRICHLET-MULTINOMIAL REGRESSION WITH AN APPLICATION TO MICROBIOME DATA ANALYSIS

被引:141
作者
Chen, Jun [1 ]
Li, Hongzhe [1 ]
机构
[1] Univ Penn, Dept Biostat & Epidemiol, Philadelphia, PA 19104 USA
关键词
Coordinate descent; counts data; overdispersion; regularized likelihood; sparse group penalty; COUNT DATA;
D O I
10.1214/12-AOAS592
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
With the development of next generation sequencing technology, researchers have now been able to study the microbiome composition using direct sequencing, whose output are bacterial taxa counts for each microbiome sample. One goal of microbiome study is to associate the microbiome composition with environmental covariates. We propose to model the taxa counts using a Dirichlet-multinomial (DM) regression model in order to account for overdispersion of observed counts. The DM regression model can be used for testing the association between taxa composition and covariates using the likelihood ratio test. However, when the number of covariates is large, multiple testing can lead to loss of power. To address the high dimensionality of the problem, we develop a penalized likelihood approach to estimate the regression parameters and to select the variables by imposing a sparse group l(1) penalty to encourage both group-level and within-group sparsity. Such a variable selection procedure can lead to selection of the relevant covariates and their associated bacterial taxa. An efficient block-coordinate descent algorithm is developed to solve the optimization problem. We present extensive simulations to demonstrate that the sparse DM regression can result in better identification of the microbiome-associated covariates than models that ignore overdispersion or only consider the proportions. We demonstrate the power of our method in an analysis of a data set evaluating the effects of nutrient intake on human gut microbiome composition. Our results have clearly shown that the nutrient intake is strongly associated with the human gut microbiome.
引用
收藏
页码:418 / 442
页数:25
相关论文
共 22 条
[1]  
AITCHISON J, 1982, J ROY STAT SOC B, V44, P139
[2]  
Bach F. R., 2008, P 25 INT C MACH LEAR, P33, DOI DOI 10.1145/1390156.1390161
[3]   Host-bacterial mutualism in the human intestine [J].
Bäckhed, F ;
Ley, RE ;
Sonnenburg, JL ;
Peterson, DA ;
Gordon, JI .
SCIENCE, 2005, 307 (5717) :1915-1920
[4]   Generalized additive modelling and zero inflated count data [J].
Barry, SC ;
Welsh, AH .
ECOLOGICAL MODELLING, 2002, 157 (2-3) :179-188
[5]   Individuality in gut microbiota composition is a complex polygenic trait shaped by multiple environmental and host genetic factors [J].
Benson, Andrew K. ;
Kelly, Scott A. ;
Legge, Ryan ;
Ma, Fangrui ;
Low, Soo Jen ;
Kim, Jaehyoung ;
Zhang, Min ;
Oh, Phaik Lyn ;
Nehrenberg, Derrick ;
Hua, Kunjie ;
Kachman, Stephen D. ;
Moriyama, Etsuko N. ;
Walter, Jens ;
Peterson, Daniel A. ;
Pomp, Daniel .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2010, 107 (44) :18933-18938
[6]   QIIME allows analysis of high-throughput community sequencing data [J].
Caporaso, J. Gregory ;
Kuczynski, Justin ;
Stombaugh, Jesse ;
Bittinger, Kyle ;
Bushman, Frederic D. ;
Costello, Elizabeth K. ;
Fierer, Noah ;
Pena, Antonio Gonzalez ;
Goodrich, Julia K. ;
Gordon, Jeffrey I. ;
Huttley, Gavin A. ;
Kelley, Scott T. ;
Knights, Dan ;
Koenig, Jeremy E. ;
Ley, Ruth E. ;
Lozupone, Catherine A. ;
McDonald, Daniel ;
Muegge, Brian D. ;
Pirrung, Meg ;
Reeder, Jens ;
Sevinsky, Joel R. ;
Tumbaugh, Peter J. ;
Walters, William A. ;
Widmann, Jeremy ;
Yatsunenko, Tanya ;
Zaneveld, Jesse ;
Knight, Rob .
NATURE METHODS, 2010, 7 (05) :335-336
[7]  
FRIEDMAN J., 2010, PREPRINT
[8]   Multi-level zero-inflated Poisson regression modelling of correlated count data with excess zeros [J].
Lee, AH ;
Wang, K ;
Scott, JA ;
Yau, KKW ;
McLachlan, GJ .
STATISTICAL METHODS IN MEDICAL RESEARCH, 2006, 15 (01) :47-61
[9]  
Legendre P., 2002, Numerical ecology
[10]   pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree [J].
Matsen, Frederick A. ;
Kodner, Robin B. ;
Armbrust, E. Virginia .
BMC BIOINFORMATICS, 2010, 11 :538