High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking

被引:20
作者
Wang, Fan [1 ]
Mukherjee, Sach [2 ]
Richardson, Sylvia [1 ]
Hill, Steven M. [1 ]
机构
[1] Univ Cambridge, MRC BioStat Unit, Cambridge, England
[2] German Centre Neurodegenerat Dis, DZNE, Bonn, Germany
关键词
Simulation study; High-dimensional regression; Penalized regression; Lasso; Variable selection; Prediction; NONCONCAVE PENALIZED LIKELIHOOD; DANTZIG SELECTOR; STATISTICAL ESTIMATION; MODEL SELECTION; LASSO; REGULARIZATION; SPARSITY; LARGER;
D O I
10.1007/s11222-019-09914-9
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Penalized likelihood approaches are widely used for high-dimensional regression. Although many methods have been proposed and the associated theory is now well developed, the relative efficacy of different approaches in finite-sample settings, as encountered in practice, remains incompletely understood. There is therefore a need for empirical investigations in this area that can offer practical insight and guidance to users. In this paper, we present a large-scale comparison of penalized regression methods. We distinguish between three related goals: prediction, variable selection and variable ranking. Our results span more than 2300 data-generating scenarios, including both synthetic and semisynthetic data (real covariates and simulated responses), allowing us to systematically consider the influence of various factors (sample size, dimensionality, sparsity, signal strength and multicollinearity). We consider several widely used approaches (Lasso, Adaptive Lasso, Elastic Net, Ridge Regression, SCAD, the Dantzig Selector and Stability Selection). We find considerable variation in performance between methods. Our results support a "no panacea" view, with no unambiguous winner across all scenarios or goals, even in this restricted setting where all data align well with the assumptions underlying the methods. The study allows us to make some recommendations as to which approaches may be most (or least) suitable given the goal and some data characteristics. Our empirical results complement existing theory and provide a resource to compare methods across a range of scenarios and metrics.
引用
收藏
页码:697 / 719
页数:23
相关论文
共 34 条
  • [1] [Anonymous], 2006, Journal of the Royal Statistical Society, Series B
  • [2] Integrated genomic analyses of ovarian carcinoma
    Bell, D.
    Berchuck, A.
    Birrer, M.
    Chien, J.
    Cramer, D. W.
    Dao, F.
    Dhir, R.
    DiSaia, P.
    Gabra, H.
    Glenn, P.
    Godwin, A. K.
    Gross, J.
    Hartmann, L.
    Huang, M.
    Huntsman, D. G.
    Iacocca, M.
    Imielinski, M.
    Kalloger, S.
    Karlan, B. Y.
    Levine, D. A.
    Mills, G. B.
    Morrison, C.
    Mutch, D.
    Olvera, N.
    Orsulic, S.
    Park, K.
    Petrelli, N.
    Rabeno, B.
    Rader, J. S.
    Sikic, B. I.
    Smith-McCune, K.
    Sood, A. K.
    Bowtell, D.
    Penny, R.
    Testa, J. R.
    Chang, K.
    Dinh, H. H.
    Drummond, J. A.
    Fowler, G.
    Gunaratne, P.
    Hawes, A. C.
    Kovar, C. L.
    Lewis, L. R.
    Morgan, M. B.
    Newsham, I. F.
    Santibanez, J.
    Reid, J. G.
    Trevino, L. R.
    Wu, Y. -Q.
    Wang, M.
    [J]. NATURE, 2011, 474 (7353) : 609 - 615
  • [3] SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR
    Bickel, Peter J.
    Ritov, Ya'acov
    Tsybakov, Alexandre B.
    [J]. ANNALS OF STATISTICS, 2009, 37 (04) : 1705 - 1732
  • [4] Consistent High-Dimensional Bayesian Variable Selection via Penalized Credible Regions
    Bondell, Howard D.
    Reich, Brian J.
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2012, 107 (500) : 1610 - 1624
  • [5] COORDINATE DESCENT ALGORITHMS FOR NONCONVEX PENALIZED REGRESSION, WITH APPLICATIONS TO BIOLOGICAL FEATURE SELECTION
    Breheny, Patrick
    Huang, Jian
    [J]. ANNALS OF APPLIED STATISTICS, 2011, 5 (01) : 232 - 253
  • [6] High-dimensional variable screening and bias in subsequent inference, with an empirical comparison
    Buehlmann, Peter
    Mandozzi, Jacopo
    [J]. COMPUTATIONAL STATISTICS, 2014, 29 (3-4) : 407 - 430
  • [7] Bühlmann P, 2011, SPRINGER SER STAT, P1, DOI 10.1007/978-3-642-20192-9
  • [8] Candes E, 2007, ANN STAT, V35, P2313, DOI 10.1214/009053606000001523
  • [9] Regularization in Regression: Comparing Bayesian and Frequentist Methods in a Poorly Informative Situation
    Celeux, Gilles
    El Anbari, Mohammed
    Marin, Jean-Michel
    Robert, Christian P.
    [J]. BAYESIAN ANALYSIS, 2012, 7 (02): : 477 - 502
  • [10] Efron B, 2007, ANN STAT, V35, P2358, DOI 10.1214/009053607000000433