Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis

被引:2
作者
Pan, Juming [1 ]
机构
[1] Rowan Univ, Dept Math, Glassboro, NJ 08028 USA
关键词
High-dimensional regression; Model averaging; Variable selection; Cross-validation; Jackknife; VARIABLE SELECTION;
D O I
10.1186/s12859-021-04053-3
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
BackgroundModel averaging has attracted increasing attention in recent years for the analysis of high-dimensional data. By weighting several competing statistical models suitably, model averaging attempts to achieve stable and improved prediction. In this paper, we develop a two-stage model averaging procedure to enhance accuracy and stability in prediction for high-dimensional linear regression. First we employ a high-dimensional variable selection method such as LASSO to screen redundant predictors and construct a class of candidate models, then we apply the jackknife cross-validation to optimize model weights for averaging.ResultsIn simulation studies, the proposed technique outperforms commonly used alternative methods under high-dimensional regression setting, in terms of minimizing the mean of the squared prediction error. We apply the proposed method to a riboflavin data, the result show that such method is quite efficient in forecasting the riboflavin production rate, when there are thousands of genes and only tens of subjects.ConclusionsCompared with a recent high-dimensional model averaging procedure (Ando and Li in J Am Stat Assoc 109:254-65, 2014), the proposed approach enjoys three appealing features thus has better predictive performance: (1) More suitable methods are applied for model constructing and weighting. (2) Computational flexibility is retained since each candidate model and its corresponding weight are determined in the low-dimensional setting and the quadratic programming is utilized in the cross-validation. (3) Model selection and averaging are combined in the procedure thus it makes full use of the strengths of both techniques. As a consequence, the proposed method can achieve stable and accurate predictions in high-dimensional linear models, and can greatly help practical researchers analyze genetic data in medical research.
引用
收藏
页数:17
相关论文
共 26 条
[1]  
AKAIKE H, 1979, BIOMETRIKA, V66, P237, DOI 10.1093/biomet/66.2.237
[2]   A WEIGHT-RELAXED MODEL AVERAGING APPROACH FOR HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS [J].
Ando, Tomohiro ;
Li, Ker-Chau .
ANNALS OF STATISTICS, 2017, 45 (06) :2654-2679
[3]   A Model-Averaging Approach for High-Dimensional Regression [J].
Ando, Tomohiro ;
Li, Ker-Chau .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2014, 109 (505) :254-265
[4]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[5]   Model selection: An integral part of inference [J].
Buckland, ST ;
Burnham, KP ;
Augustin, NH .
BIOMETRICS, 1997, 53 (02) :603-618
[6]   High-dimensional variable screening and bias in subsequent inference, with an empirical comparison [J].
Buehlmann, Peter ;
Mandozzi, Jacopo .
COMPUTATIONAL STATISTICS, 2014, 29 (3-4) :407-430
[7]  
Cule E, 2012, ARXIV12050686V1STATA
[8]   Sure independence screening for ultrahigh dimensional feature space [J].
Fan, Jianqing ;
Lv, Jinchi .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2008, 70 :849-883
[9]   Variable selection via nonconcave penalized likelihood and its oracle properties [J].
Fan, JQ ;
Li, RZ .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2001, 96 (456) :1348-1360
[10]   Variable selection using random forests [J].
Genuer, Robin ;
Poggi, Jean-Michel ;
Tuleau-Malot, Christine .
PATTERN RECOGNITION LETTERS, 2010, 31 (14) :2225-2236