Comparison of subset selection methods in linear regression in the context of health-related quality of life and substance abuse in Russia

被引:54
作者
Morozova, Olga [1 ]
Levina, Olga [1 ]
Uuskuela, Anneli [2 ]
Heimer, Robert [1 ]
机构
[1] Yale Univ, Sch Publ Hlth, Dept Epidemiol Microbial Dis, New Haven, CT 06520 USA
[2] Univ Tartu, Dept Publ Hlth, EE-50090 Tartu, Estonia
基金
美国国家卫生研究院;
关键词
Bayesian model selection; Penalized least squares; Stepwise regression; Linear regression; Subset selection; Quality of life; Substance abuse; HIV; Russia; VARIABLE SELECTION; MODEL SELECTION; HIV; INFERENCE; NOISE; LASSO; REGULARIZATION; INDIVIDUALS; PERFORMANCE; ALGORITHMS;
D O I
10.1186/s12874-015-0066-2
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: Automatic stepwise subset selection methods in linear regression often perform poorly, both in terms of variable selection and estimation of coefficients and standard errors, especially when number of independent variables is large and multicollinearity is present. Yet, stepwise algorithms remain the dominant method in medical and epidemiological research. Methods: Performance of stepwise (backward elimination and forward selection algorithms using AIC, BIC, and Likelihood Ratio Test, p = 0.05 (LRT)) and alternative subset selection methods in linear regression, including Bayesian model averaging (BMA) and penalized regression (lasso, adaptive lasso, and adaptive elastic net) was investigated in a dataset from a cross-sectional study of drug users in St. Petersburg, Russia in 2012-2013. Dependent variable measured health-related quality of life, and independent correlates included 44 variables measuring demographics, behavioral, and structural factors. Results: In our case study all methods returned models of different size and composition varying from 41 to 11 variables. The percentage of significant variables among those selected in final model varied from 100 % to 27 %. Model selection with stepwise methods was highly unstable, with most (and all in case of backward elimination: BIC, forward selection: BIC, and backward elimination: LRT) of the selected variables being significant (95 % confidence interval for coefficient did not include zero). Adaptive elastic net demonstrated improved stability and more conservative estimates of coefficients and standard errors compared to stepwise. By incorporating model uncertainty into subset selection and estimation of coefficients and their standard deviations, BMA returned a parsimonious model with the most conservative results in terms of covariates significance. Conclusions: BMA and adaptive elastic net performed best in our analysis. Based on our results and previous theoretical studies the use of stepwise methods in medical and epidemiological research may be outperformed by alternative methods in cases such as ours. In situations of high uncertainty it is beneficial to apply different methodologically sound subset selection methods, and explore where their outputs do and do not agree. We recommend that researchers, at a minimum, should explore model uncertainty and stability as part of their analyses, and report these details in epidemiological papers.
引用
收藏
页数:17
相关论文
共 60 条
  • [1] NEW LOOK AT STATISTICAL-MODEL IDENTIFICATION
    AKAIKE, H
    [J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1974, AC19 (06) : 716 - 723
  • [2] [Anonymous], 2003, Bayesian Data Analysis
  • [3] [Anonymous], 2002, Subset selection in regression: chapman and hall
  • [4] Optimal predictive model selection
    Barbieri, MM
    Berger, JO
    [J]. ANNALS OF STATISTICS, 2004, 32 (03) : 870 - 897
  • [5] Cross-validation methods
    Browne, MW
    [J]. JOURNAL OF MATHEMATICAL PSYCHOLOGY, 2000, 44 (01) : 108 - 132
  • [6] AIC model selection and multimodel inference in behavioral ecology: some background, observations, and comparisons
    Burnham, Kenneth P.
    Anderson, David R.
    Huyvaert, Kathryn P.
    [J]. BEHAVIORAL ECOLOGY AND SOCIOBIOLOGY, 2011, 65 (01) : 23 - 35
  • [7] Burnham KP., 2002, MODEL SELECTION MULT
  • [8] BACKWARD, FORWARD AND STEPWISE AUTOMATED SUBSET-SELECTION ALGORITHMS - FREQUENCY OF OBTAINING AUTHENTIC AND NOISE VARIABLES
    DERKSEN, S
    KESELMAN, HJ
    [J]. BRITISH JOURNAL OF MATHEMATICAL & STATISTICAL PSYCHOLOGY, 1992, 45 : 265 - 282
  • [9] The self-reported personal wellbeing of a sample of Australian injecting drug users
    Dietze, Paul
    Stoove, Mark
    Miller, Peter
    Kinner, Stuart
    Bruno, Raimondo
    Alati, Rosa
    Burns, Lucy
    [J]. ADDICTION, 2010, 105 (12) : 2141 - 2148
  • [10] Health-related quality of life of people living with HIV followed up in hospitals in France: comparing trends and correlates between 2003 and 2011 (ANRS-VESPA and VESPA2 national surveys)
    Douab, Taoufiq
    Marcellin, Fabienne
    Vilotitch, Antoine
    Protopopescu, Camelia
    Preau, Marie
    Suzan-Monti, Marie
    Sagaon-Teyssier, Luis
    Lert, France
    Carrieri, Maria Patrizia
    Dray-Spira, Rosemary
    Spire, Bruno
    [J]. AIDS CARE-PSYCHOLOGICAL AND SOCIO-MEDICAL ASPECTS OF AIDS/HIV, 2014, 26 : S29 - S40