In Defense of the Indefensible: A Very Naive Approach to High-Dimensional Inference

被引:0
作者
Zhao, Sen [1 ]
Witten, Daniela [2 ]
Shojaie, Ali [3 ]
机构
[1] Google Res, 1600 Amphitheatre Pkwy, Mountain View, CA 94043 USA
[2] Univ Washington, Stat & Biostat, Hlth Sci Bldg,Box 357232, Seattle, WA 98195 USA
[3] Univ Washington, Biostat, Hlth Sci Bldg,Box 357232, Seattle, WA 98195 USA
基金
美国国家卫生研究院;
关键词
Confidence interval; lasso; p-value; post-selection inference; significance testing; POST-SELECTION INFERENCE; MODEL-SELECTION; CONFIDENCE-INTERVALS; VARIABLE SELECTION; P-VALUES; REGRESSION; REGIONS; LASSO; ESTIMATORS; SHRINKAGE;
D O I
10.1214/20-STS815
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
A great deal of interest has recently focused on conducting inference on the parameters in a high-dimensional linear model. In this paper, we consider a simple and very naive two-step procedure for this task, in which we (i) fit a lasso model in order to obtain a subset of the variables, and (ii) fit a least squares model on the lasso-selected set. Conventional statistical wisdom tells us that we cannot make use of the standard statistical inference tools for the resulting least squares model (such as confidence intervals and p-values), since we peeked at the data twice: once in running the lasso, and again in fitting the least squares model. However, in this paper, we show that under a certain set of assumptions, with high probability, the set of variables selected by the lasso is identical to the one selected by the noiseless lasso and is hence deterministic. Consequently, the naive two-step approach can yield asymptotically valid inference. We utilize this finding to develop the naive confidence interval, which can be used to draw inference on the regression coefficients of the model selected by the lasso, as well as the naive score test, which can be used to test the hypotheses regarding the full-model regression coefficients.
引用
收藏
页码:562 / 577
页数:16
相关论文
共 64 条
[1]   Least squares after model selection in high-dimensional sparse models [J].
Belloni, Alexandre ;
Chernozhukov, Victor .
BERNOULLI, 2013, 19 (02) :521-547
[2]   VALID POST-SELECTION INFERENCE [J].
Berk, Richard ;
Brown, Lawrence ;
Buja, Andreas ;
Zhang, Kai ;
Zhao, Linda .
ANNALS OF STATISTICS, 2013, 41 (02) :802-837
[3]   SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR [J].
Bickel, Peter J. ;
Ritov, Ya'acov ;
Tsybakov, Alexandre B. .
ANNALS OF STATISTICS, 2009, 37 (04) :1705-1732
[4]   NOTE ON DATA-SPLITTING FOR EVALUATION OF SIGNIFICANCE LEVELS [J].
COX, DR .
BIOMETRIKA, 1975, 62 (02) :441-444
[5]   High-Dimensional Inference: Confidence Intervals, p-Values and R-Software hdi [J].
Dezeure, Ruben ;
Buehlmann, Peter ;
Meier, Lukas ;
Meinshausen, Nicolai .
STATISTICAL SCIENCE, 2015, 30 (04) :533-558
[6]   Variance estimation in high-dimensional linear models [J].
Dicker, Lee H. .
BIOMETRIKA, 2014, 101 (02) :269-284
[7]   A necessary and sufficient condition for exact sparse recovery by l1 minimization [J].
Dossal, Charles .
COMPTES RENDUS MATHEMATIQUE, 2012, 350 (1-2) :117-120
[8]  
Erdos P., 2022, I Publicationes Mathematicae Debrecen, V6, P290
[9]  
Fan JQ, 2012, J ROY STAT SOC B, V74, P37, DOI 10.1111/j.1467-9868.2011.01005.x
[10]   Variable selection via nonconcave penalized likelihood and its oracle properties [J].
Fan, JQ ;
Li, RZ .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2001, 96 (456) :1348-1360