High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking

被引:0
作者
Fan Wang
Sach Mukherjee
Sylvia Richardson
Steven M. Hill
机构
[1] University of Cambridge,MRC Biostatistics Unit
[2] German Centre for Neurodegenerative Diseases (DZNE),undefined
来源
Statistics and Computing | 2020年 / 30卷
关键词
Simulation study; High-dimensional regression; Penalized regression; Lasso; Variable selection; Prediction;
D O I
暂无
中图分类号
学科分类号
摘要
Penalized likelihood approaches are widely used for high-dimensional regression. Although many methods have been proposed and the associated theory is now well developed, the relative efficacy of different approaches in finite-sample settings, as encountered in practice, remains incompletely understood. There is therefore a need for empirical investigations in this area that can offer practical insight and guidance to users. In this paper, we present a large-scale comparison of penalized regression methods. We distinguish between three related goals: prediction, variable selection and variable ranking. Our results span more than 2300 data-generating scenarios, including both synthetic and semisynthetic data (real covariates and simulated responses), allowing us to systematically consider the influence of various factors (sample size, dimensionality, sparsity, signal strength and multicollinearity). We consider several widely used approaches (Lasso, Adaptive Lasso, Elastic Net, Ridge Regression, SCAD, the Dantzig Selector and Stability Selection). We find considerable variation in performance between methods. Our results support a “no panacea” view, with no unambiguous winner across all scenarios or goals, even in this restricted setting where all data align well with the assumptions underlying the methods. The study allows us to make some recommendations as to which approaches may be most (or least) suitable given the goal and some data characteristics. Our empirical results complement existing theory and provide a resource to compare methods across a range of scenarios and metrics.
引用
收藏
页码:697 / 719
页数:22
相关论文
共 85 条
  • [1] Bickel PJ(2009)Simultaneous analysis of Lasso and Dantzig selector Ann. Stat. 37 1705-1732
  • [2] Ritov Y(2012)Consistent high-dimensional Bayesian variable selection via penalized credible regions J. Am. Stat. Assoc. 107 1610-1624
  • [3] Tsybakov AB(2011)Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection Ann. Appl. Stat. 5 232-253
  • [4] Bondell HD(2014)High-dimensional variable screening and bias in subsequent inference, with an empirical comparison Comput. Stat. 29 407-430
  • [5] Reich BJ(2007)The Dantzig selector: statistical estimation when Ann. Stat. 35 2313-2351
  • [6] Breheny P(2012) is much larger than Bayesian Anal. 7 477-502
  • [7] Huang J(2007)Regularization in regression: comparing Bayesian and frequentist methods in a poorly informative situation Ann. Stat. 35 2358-2364
  • [8] Bühlmann P(2001)Discussion: The Dantzig selector: statistical estimation when J. Am. Stat. Assoc. 96 1348-1360
  • [9] Mandozzi J(2010) is much larger than Stat. Sin. 20 101-148
  • [10] Candes E(2004)Variable selection via nonconcave penalized likelihood and its oracle properties Ann. Stat. 32 928-961