Performance of using multiple stepwise algorithms for variable selection

被引:65
作者
Wiegand, Ryan E. [1 ]
机构
[1] Ctr Dis Control & Prevent, Div HIV AIDS Prevent, Natl Ctr HIV Viral Hepatitis STD & TB Prevent, Atlanta, GA 30333 USA
关键词
stepwise; variable selection; regression; LOGISTIC-REGRESSION ANALYSIS; QUANTITATIVE TRAIT LOCI; MODEL SELECTION; RISK-FACTORS; SIMULATION; ASSOCIATION; DISEASE; IDENTIFICATION; PREDICTION; SURVIVAL;
D O I
10.1002/sim.3943
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Some research studies in the medical literature use multiple stepwise variable selection (SVS) algorithms to build multivariable models. The purpose of this study is to determine whether the use of multiple SVS algorithms in tandem (stepwise agreement) is a valid variable selection procedure. Computer simulations were developed to address stepwise agreement. Three popular SVS algorithms were tested (backward elimination, forward selection, and stepwise) on three statistical methods (linear, logistic, and Cox proportional hazards regression). Other simulation parameters explored were the sample size, number of predictors considered, degree of correlation between pairs of predictors, p-value-based entrance and exit criteria, predictor type (normally distributed or binary), and differences between stepwise agreement between any two or all three algorithms. Among stepwise methods, the rate of agreement, agreement on a model including only those predictors truly associated with the outcome, and agreement on a model containing the predictors truly associated with the outcome were measured. These rates were dependent on all simulation parameters. Mostly, the SVS algorithms agreed on a final model, but rarely on a model with only the true predictors. Sample size and candidate predictor pool size are the most influential simulation conditions. To conclude, stepwise agreement is often a poor strategy that gives misleading results and researchers should avoid using multiple SVS algorithms to build multivariable models. More research on the relationship between sample size and variable selection is needed. Published in 2010 by John Wiley & Sons, Ltd.
引用
收藏
页码:1647 / 1659
页数:13
相关论文
共 117 条
[1]  
Abt K., 1967, Metrika, V12, P1
[2]   NEW LOOK AT STATISTICAL-MODEL IDENTIFICATION [J].
AKAIKE, H .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1974, AC19 (06) :716-723
[3]   BOOTSTRAP INVESTIGATION OF THE STABILITY OF A COX REGRESSION-MODEL [J].
ALTMAN, DG ;
ANDERSEN, PK .
STATISTICS IN MEDICINE, 1989, 8 (07) :771-783
[4]   Simplifying a prognostic model: a simulation study based on clinical data [J].
Ambler, G ;
Brady, AR ;
Royston, P .
STATISTICS IN MEDICINE, 2002, 21 (24) :3803-3822
[5]   Rethinking the paper helicopter: Combining statistical and engineering knowledge [J].
Annis, DH .
AMERICAN STATISTICIAN, 2005, 59 (04) :320-326
[6]  
[Anonymous], 2006, Journal of the Royal Statistical Society, Series B
[7]   Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality [J].
Austin, PC ;
Tu, JV .
JOURNAL OF CLINICAL EPIDEMIOLOGY, 2004, 57 (11) :1138-1146
[8]   Bootstrap methods for developing predictive models [J].
Austin, PC ;
Tu, JV .
AMERICAN STATISTICIAN, 2004, 58 (02) :131-137
[9]   Inflation of the type I error rate when a continuous confounding variable is categorized in logistic regression analyses [J].
Austin, PC ;
Brunner, LJ .
STATISTICS IN MEDICINE, 2004, 23 (07) :1159-1178
[10]   A three-item scale for the early prediction of stroke recovery [J].
Baird, AE ;
Dambrosia, J ;
Janket, SJ ;
Eichbaum, Q ;
Chaves, C ;
Silver, B ;
Barber, PA ;
Parsons, M ;
Darby, D ;
Davis, S ;
Caplan, LR ;
Edelman, RE ;
Warach, S .
LANCET, 2001, 357 (9274) :2095-2099