Comparative methods for handling missing data in large databases

被引:79
作者
Henry, Antonia J. [1 ,2 ]
Hevelone, Nathanael D. [2 ]
Lipsitz, Stuart [2 ]
Nguyen, Louis L. [1 ,2 ]
机构
[1] Harvard Univ, Sch Med, Brigham & Womens Hosp, Div Vasc & Endovasc Surg, Cambridge, MA 02138 USA
[2] Harvard Univ, Sch Med, Brigham & Womens Hosp, Ctr Surg & Publ Hlth, Cambridge, MA 02138 USA
基金
美国国家卫生研究院;
关键词
ESTIMATING-EQUATIONS; MULTIPLE IMPUTATION; COVARIATE DATA; REGRESSION;
D O I
10.1016/j.jvs.2013.05.008
中图分类号
R61 [外科手术学];
学科分类号
摘要
Objective: Analysis of complex survey databases is an important tool for health services researchers. Missing data elements are challenging because the reasons for "missingness" are multifactorial, especially categorical variables such as race. We simulated missing data for race and analyzed the bias from five methods used in predicting major amputation in patients with critical limb ischemia (CLI). Methods: Patient discharges with fully observed data containing lower extremity revascularization or major amputation and CLI were selected from the 2003 to 2007 Nationwide Inpatient Sample, a complex survey database (weighted n = 684,057). Considering several random missing data schemes, we compared five missing data methods: complete case analysis, replacement with observed frequencies, missing indicator variable, multiple imputation, and reweighted estimating equations. We created 100 simulated data sets, with 5%, 15%, or 30% of subjects' race drawn to be missing from the full data set. Bias was estimated by comparing the estimated regression coefficients averaged over 100 simulated data sets (beta(miss)) from each method vs estimates from the fully observed data set (beta(full)), with relative bias calculated as (beta(full)-beta(miss)/beta(full)) x 100%. Results: Our results demonstrate that reweighted estimating equations produce the least biased and the missing indicator variable produces the most biased coefficients. Complete case analysis, replacement with observed frequencies, and multiple imputation resulted in moderate bias. Sensitivity analysis demonstrated the optimal method choice depends on the quantity and type of missing data encountered. Conclusions: Missing data are an important analytic topic in research with large databases. The commonly used missing indicator variable method introduces severe bias and should be used with caution. We present empiric evidence to guide method selection for handling missing data.
引用
收藏
页码:1353 / +
页数:13
相关论文
共 14 条
[1]  
Allison PD, 2002, MISSING DATA, P1
[2]  
ALLISON PD, 2005, SUGI 30 P, P1
[3]  
[Anonymous], J AM STAT ASSOC
[4]  
Bergmo P.E.S., 2010, INT J GREENH GAS CON, P1
[5]   Missing Data Analysis: Making It Work in the Real World [J].
Graham, John W. .
ANNUAL REVIEW OF PSYCHOLOGY, 2009, 60 :549-576
[6]  
HCUP Databases, 2003, HCUP DAT
[7]  
HCUP National Inpatient Sample (NIS), 2003, HCUP NAT INP SAMPL N
[8]   Missing Data Analysis Using Multiple Imputation Getting to the Heart of the Matter [J].
He, Yulei .
CIRCULATION-CARDIOVASCULAR QUALITY AND OUTCOMES, 2010, 3 (01) :98-U145
[9]   Socioeconomic and hospital-related predictors of amputation for critical limb ischemia [J].
Henry, Antonia J. ;
Hevelone, Nathanael D. ;
Belkin, Michael ;
Nguyen, Louis L. .
JOURNAL OF VASCULAR SURGERY, 2011, 53 (02) :330-339
[10]   REGRESSION WITH MISSING XS - A REVIEW [J].
LITTLE, RJA .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1992, 87 (420) :1227-1237