Variable selection techniques after multiple imputation in high-dimensional data

被引:5
作者
Zahid, Faisal Maqbool [1 ]
Faisal, Shahla [1 ]
Heumann, Christian [2 ]
机构
[1] Govt Coll Univ Faisalabad, Dept Stat, Faisalabad, Pakistan
[2] Ludwig Maximilians Univ Munchen, Dept Stat, Munich, Germany
关键词
High-dimensional data; Multiple imputation; LASSO; Rubin's rules; Variable selection; MISSING DATA; MODEL SELECTION; REGULARIZATION; LIKELIHOOD; REGRESSION; INFERENCE;
D O I
10.1007/s10260-019-00493-7
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
High-dimensional data arise from diverse fields of scientific research. Missing values are often encountered in such data. Variable selection plays a key role in high-dimensional data analysis. Like many other statistical techniques, variable selection requires complete cases without any missing values. A variety of variable selection techniques for complete data is available, but similar techniques for the data with missing values are deficient in the literature. Multiple imputation is a popular approach to handle missing values and to get completed data. If a particular variable selection technique is applied independently on each of the multiply imputed datasets, a different model for each dataset may be the result. It is still unclear in the literature how to implement variable selection techniques on multiply imputed data. In this paper, we propose to use the magnitude of the parameter estimates of each candidate predictor across all the imputed datasets for its selection. A constraint is imposed on the sum of absolute values of these estimates to select or remove the predictor from the model. The proposed method for identifying the informative predictors is compared with other approaches in an extensive simulation study. The performance is compared on the basis of the hit rates (proportion of correctly identified informative predictors) and the false alarm rates (proportion of non-informative predictors dubbed as informative) for different numbers of imputed datasets. The proposed technique is simple and easy to implement, and performs equally well in the high-dimensional case as in the low-dimensional settings. The proposed technique is observed to be a good competitor to the existing approaches in different simulation settings. The performance of different variable selection techniques is also examined for a real dataset with missing values.
引用
收藏
页码:553 / 580
页数:28
相关论文
共 50 条
[1]  
[Anonymous], 2009, Proceedings of the 26th International Conference on Machine Learning, DOI DOI 10.1145/1553374.1553487
[2]  
[Anonymous], 2006, J ROYAL STAT SOC B
[3]  
[Anonymous], 2004, Multiple imputation for nonresponse in surveys
[4]  
Brand JPL, 1999, THESIS ERASMUS U
[5]   Sensitivity analysis after multiple imputation under missing at random: a weighting approach [J].
Carpenter, James R. ;
Kenward, Michael G. ;
White, Ian R. .
STATISTICAL METHODS IN MEDICAL RESEARCH, 2007, 16 (03) :259-275
[6]   Variable selection for multiply-imputed data with application to dioxin exposure study [J].
Chen, Qixuan ;
Wang, Sijian .
STATISTICS IN MEDICINE, 2013, 32 (21) :3646-3659
[7]   Developing a prognostic model in the presence of missing data: an ovarian cancer case study [J].
Clark, TG ;
Altman, DG .
JOURNAL OF CLINICAL EPIDEMIOLOGY, 2003, 56 (01) :28-37
[8]   MISSING DATA, IMPUTATION, AND THE BOOTSTRAP [J].
EFRON, B .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1994, 89 (426) :463-475
[9]  
Fan Jianqing, 2006, ARXIVMATH0602133
[10]   Variable selection via nonconcave penalized likelihood and its oracle properties [J].
Fan, JQ ;
Li, RZ .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2001, 96 (456) :1348-1360