Bayesian Hyper-LASSO Classification for Feature Selection with Application to Endometrial Cancer RNA-seq Data

被引:0
作者
Lai Jiang
Celia M. T. Greenwood
Weixin Yao
Longhai Li
机构
[1] Jewish General Hospital,Lady Davis Institute for Medical Research
[2] McGill University,Department of Epidemiology, Biostatistics and Occupational Health
[3] McGill University,Gerald Bronfman Department of Oncology
[4] University of California,Department of Statistics
[5] University of Saskatchewan,Department of Mathematics and Statistics
来源
Scientific Reports | / 10卷
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Feature selection is demanded in many modern scientific research problems that use high-dimensional data. A typical example is to identify gene signatures that are related to a certain disease from high-dimensional gene expression data. The expression of genes may have grouping structures, for example, a group of co-regulated genes that have similar biological functions tend to have similar expressions. Thus it is preferable to take the grouping structure into consideration to select features. In this paper, we propose a Bayesian Robit regression method with Hyper-LASSO priors (shortened by BayesHL) for feature selection in high dimensional genomic data with grouping structure. The main features of BayesHL include that it discards more aggressively unrelated features than LASSO, and it makes feature selection within groups automatically without a pre-specified grouping structure. We apply BayesHL in gene expression analysis to identify subsets of genes that contribute to the 5-year survival outcome of endometrial cancer (EC) patients. Results show that BayesHL outperforms alternative methods (including LASSO, group LASSO, supervised group LASSO, penalized logistic regression, random forest, neural network, XGBoost and knockoff) in terms of predictive power, sparsity and the ability to uncover grouping structure, and provides insight into the mechanisms of multiple genetic pathways leading to differentiated EC survival outcome.
引用
收藏
相关论文
共 82 条
[1]  
Clarke R(2008)The properties of high-dimensional data spaces: implications for exploring gene and protein expression data Nat. Rev. Cancer 8 37-49
[2]  
Candes E(2018)Panning for gold: model-x. knockoffs for high dimensional controlled variable selection Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80 551-577
[3]  
Fan Y(2018)Gene hunting with hidden markov model knockoffs Biometrika 106 1-18
[4]  
Janson L(2018)Fully bayesian logistic regression with hyper-lasso priors for high-dimensional feature selection Journal of Statistical Computation and Simulation 88 2827-2851
[5]  
Lv J(2008)A weakly informative default prior distribution for logistic and other regression models The Annals of Applied Statistics 2 1360-1383
[6]  
Sesia M(2006)Prior distributions for variance parameters in hierarchical models Bayesian analysis 1 515-533
[7]  
Sabatti C(2010)The horseshoe estimator for sparse signals Biometrika 97 465-465
[8]  
Candès E(2014)The horseshoe estimator: Posterior concentration around nearly black vectors Electronic Journal of Statistics 8 2585-2618
[9]  
Li L(2011)Bayesian Hyper-Lassos with Non-Convex penalization Australian & New Zealand Journal of Statistics 53 423-442
[10]  
Yao W(2006)The adaptive lasso and its oracle properties Journal of the American Statistical Association 101 1418-1429