Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

被引:115
作者
Hong, Shangzhi [1 ]
Lynn, Henry S. [1 ]
机构
[1] Fudan Univ, Sch Publ Hlth, Dept Biostat, Key Lab Publ Hlth Safety,Minist Educ, Shanghai, Peoples R China
关键词
Missing data imputation; Imputation accuracy; Random forest; MICE;
D O I
10.1186/s12874-020-01080-1
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background Missing data are common in statistical analyses, and imputation methods based on random forests (RF) are becoming popular for handling missing data especially in biomedical research. Unlike standard imputation approaches, RF-based imputation methods do not assume normality or require specification of parametric models. However, it is still inconclusive how they perform for non-normally distributed data or when there are non-linear relationships or interactions. Methods To examine the effects of these three factors, a variety of datasets were simulated with outcome-dependent missing at random (MAR) covariates, and the performances of the RF-based imputation methods missForest and CALIBERrfimpute were evaluated in comparison with predictive mean matching (PMM). Results Both missForest and CALIBERrfimpute have high predictive accuracy but missForest can produce severely biased regression coefficient estimates and downward biased confidence interval coverages, especially for highly skewed variables in nonlinear models. CALIBERrfimpute typically outperforms missForest when estimating regression coefficients, although its biases are still substantial and can be worse than PMM for logistic regression relationships with interaction. Conclusions RF-based imputation, in particular missForest, should not be indiscriminately recommended as a panacea for imputing missing data, especially when data are highly skewed and/or outcome-dependent MAR. A correct analysis requires a careful critique of the missing data mechanism and the inter-relationships between the variables in the data.
引用
收藏
页数:12
相关论文
共 18 条
[1]   What is the difference between missing completely at random and missing at random? [J].
Bhaskaran, Krishnan ;
Smeeth, Liam .
INTERNATIONAL JOURNAL OF EPIDEMIOLOGY, 2014, 43 (04) :1336-1339
[2]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[3]   LI-RADS for MR Imaging Diagnosis of Hepatocellular Carcinoma: Performance of Major and Ancillary Features [J].
Cerny, Milena ;
Bergeron, Catherine ;
Billiard, Jean-Sebastien ;
Murphy-Lavallee, Jessica ;
Olivie, Damien ;
Berube, Joshua ;
Fan, Boyan ;
Castel, Helene ;
Turcotte, Simon ;
Perreault, Pierre ;
Chagnon, Miguel ;
Tang, An .
RADIOLOGY, 2018, 288 (01) :118-128
[4]   A CONCORDANCE CORRELATION-COEFFICIENT TO EVALUATE REPRODUCIBILITY [J].
LIN, LI .
BIOMETRICS, 1989, 45 (01) :255-268
[5]   A Bayesian missing value estimation method for gene expression profile data [J].
Oba, S ;
Sato, M ;
Takemasa, I ;
Monden, M ;
Matsubara, K ;
Ishii, S .
BIOINFORMATICS, 2003, 19 (16) :2088-2096
[6]  
Pournelle G. H., 1953, Journal of Mammalogy, V34, P133
[7]   Predicting missing values: a comparative study on non-parametric approaches for imputation [J].
Ramosaj, Burim ;
Pauly, Markus .
COMPUTATIONAL STATISTICS, 2019, 34 (04) :1741-1764
[8]  
RUBIN DB, 1976, BIOMETRIKA, V63, P581, DOI 10.1093/biomet/63.3.581
[9]   Generating missing values for simulation purposes: a multivariate amputation procedure [J].
Schouten, Rianne Margaretha ;
Lugtig, Peter ;
Vink, Gerko .
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2018, 88 (15) :2909-2930
[10]   Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study [J].
Shah, Anoop D. ;
Bartlett, Jonathan W. ;
Carpenter, James ;
Nicholas, Owen ;
Hemingway, Harry .
AMERICAN JOURNAL OF EPIDEMIOLOGY, 2014, 179 (06) :764-774