A comparison of model selection methods for prediction in the presence of multiply imputed data

被引:31
作者
Le Thi Phuong Thao [1 ]
Geskus, Ronald [1 ,2 ]
机构
[1] Univ Oxford, Biostat Grp, Clin Res Unit, Ho Chi Minh City, Vietnam
[2] Univ Oxford, Nuffield Dept Med, Oxford, England
基金
英国惠康基金;
关键词
lasso; multiply imputed data; prediction; stacked data; variable selection; VARIABLE SELECTION;
D O I
10.1002/bimj.201700232
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Many approaches for variable selection with multiply imputed data in the development of a prognostic model have been proposed. However, no method prevails as uniformly best. We conducted a simulation study with a binary outcome and a logistic regression model to compare two classes of variable selection methods in the presence of MI data: (I) Model selection on bootstrap data, using backward elimination based on AIC or lasso, and fit the final model based on the most frequently (e.g. >= 50%) selected variables over all MI and bootstrap data sets; (II) Model selection on original MI data, using lasso. The final model is obtained by (i) averaging estimates of variables that were selected in any MI data set or (ii) in 50% of the MI data; (iii) performing lasso on the stacked MI data, and (iv) as in (iii) but using individual weights as determined by the fraction of missingness. In all lasso models, we used both the optimal penalty and the 1-se rule. We considered recalibrating models to correct for overshrinkage due to the suboptimal penalty by refitting the linear predictor or all individual variables. We applied the methods on a real dataset of 951 adult patients with tuberculous meningitis to predict mortality within nine months. Overall, applying lasso selection with the 1-se penalty shows the best performance, both in approach I and II. Stacking MI data is an attractive approach because it does not require choosing a selection threshold when combining results from separate MI data sets
引用
收藏
页码:343 / 356
页数:14
相关论文
共 50 条
  • [31] Sparse generalized linear model with L0 approximation for feature selection and prediction with big omics data
    Liu, Zhenqiu
    Sun, Fengzhu
    McGovern, Dermot P.
    BIODATA MINING, 2017, 10
  • [32] Comparison of Feature Selection Methods-Modelling COPD Outcomes
    Cabral, Jorge
    Macedo, Pedro
    Marques, Alda
    Afreixo, Vera
    MATHEMATICS, 2024, 12 (09)
  • [33] Input selection methods for data-driven Soft sensors design: Application to an industrial process
    Curreri, Francesco
    Graziani, Salvatore
    Xibilia, Maria Gabriella
    INFORMATION SCIENCES, 2020, 537 : 1 - 17
  • [34] Spectrometric prediction of wood basic density by comparison of different grain angles and variable selection methods
    Li, Yanjie
    Liu, Wenjian
    Cao, Ruishu
    Tan, Zifeng
    Liu, Jun
    Jiang, Jingmin
    PLANT METHODS, 2021, 17 (01)
  • [35] Comparison of shrinkage and feature selection methods for modeling GDP of Pakistan
    Dar, Irum Sajjad
    Afzal, Muhammad
    Khalil, Sadia
    Shamim, Maira
    JOURNAL OF STATISTICS AND MANAGEMENT SYSTEMS, 2022, 25 (04) : 957 - 969
  • [36] Comparison of variable selection methods in partial least squares regression
    Mehmood, Tahir
    Saebo, Solve
    Liland, Kristian Hovde
    JOURNAL OF CHEMOMETRICS, 2020, 34 (06)
  • [37] Model Selection Consistency of Lasso for Empirical Data
    Yang, Yuehan
    Yang, Hu
    CHINESE ANNALS OF MATHEMATICS SERIES B, 2018, 39 (04) : 607 - 620
  • [38] Methods and criteria for model selection
    Kadane, JB
    Lazar, NA
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2004, 99 (465) : 279 - 290
  • [39] Spectrometric prediction of wood basic density by comparison of different grain angles and variable selection methods
    Yanjie Li
    Wenjian Liu
    Ruishu Cao
    Zifeng Tan
    Jun Liu
    Jingmin Jiang
    Plant Methods, 17
  • [40] Flexible variable selection in the presence of missing data
    Williamson, Brian D.
    Huang, Ying
    INTERNATIONAL JOURNAL OF BIOSTATISTICS, 2024, 20 (02) : 347 - 359