Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

被引:139
作者
Hong, Shangzhi [1 ]
Lynn, Henry S. [1 ]
机构
[1] Fudan Univ, Sch Publ Hlth, Dept Biostat, Key Lab Publ Hlth Safety,Minist Educ, Shanghai, Peoples R China
关键词
Missing data imputation; Imputation accuracy; Random forest; MICE;
D O I
10.1186/s12874-020-01080-1
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background Missing data are common in statistical analyses, and imputation methods based on random forests (RF) are becoming popular for handling missing data especially in biomedical research. Unlike standard imputation approaches, RF-based imputation methods do not assume normality or require specification of parametric models. However, it is still inconclusive how they perform for non-normally distributed data or when there are non-linear relationships or interactions. Methods To examine the effects of these three factors, a variety of datasets were simulated with outcome-dependent missing at random (MAR) covariates, and the performances of the RF-based imputation methods missForest and CALIBERrfimpute were evaluated in comparison with predictive mean matching (PMM). Results Both missForest and CALIBERrfimpute have high predictive accuracy but missForest can produce severely biased regression coefficient estimates and downward biased confidence interval coverages, especially for highly skewed variables in nonlinear models. CALIBERrfimpute typically outperforms missForest when estimating regression coefficients, although its biases are still substantial and can be worse than PMM for logistic regression relationships with interaction. Conclusions RF-based imputation, in particular missForest, should not be indiscriminately recommended as a panacea for imputing missing data, especially when data are highly skewed and/or outcome-dependent MAR. A correct analysis requires a careful critique of the missing data mechanism and the inter-relationships between the variables in the data.
引用
收藏
页数:12
相关论文
共 18 条
[11]   Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study [J].
Shah, Anoop D. ;
Bartlett, Jonathan W. ;
Carpenter, James ;
Nicholas, Owen ;
Hemingway, Harry .
AMERICAN JOURNAL OF EPIDEMIOLOGY, 2014, 179 (06) :764-774
[12]   Dynamically prognosticating patients with hepatocellular carcinoma through survival paths mapping based on time-series data [J].
Shen, Lujun ;
Zeng, Qi ;
Guo, Pi ;
Huang, Jingjun ;
Li, Chaofeng ;
Pan, Tao ;
Chang, Boyang ;
Wu, Nan ;
Yang, Lewei ;
Chen, Qifeng ;
Huang, Tao ;
Li, Wang ;
Wu, Peihong .
NATURE COMMUNICATIONS, 2018, 9
[13]   A simulation comparison of imputation methods for quantitative data in the presence of multiple data patterns [J].
Solaro, N. ;
Barbiero, A. ;
Manzi, G. ;
Ferrari, P. A. .
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2018, 88 (18) :3588-3619
[14]   MissForest-non-parametric missing value imputation for mixed-type data [J].
Stekhoven, Daniel J. ;
Buehlmann, Peter .
BIOINFORMATICS, 2012, 28 (01) :112-118
[15]   Random forest missing data algorithms [J].
Tang, Fei ;
Ishwaran, Hemant .
STATISTICAL ANALYSIS AND DATA MINING, 2017, 10 (06) :363-377
[16]  
van Buuren S., 2018, FLEXIBLE IMPUTATION
[17]   Comparison of imputation methods for missing laboratory data in medicine [J].
Waljee, Akbar K. ;
Mukherjee, Ashin ;
Singal, Amit G. ;
Zhang, Yiwei ;
Warren, Jeffrey ;
Balis, Ulysses ;
Marrero, Jorge ;
Zhu, Ji ;
Higgins, Peter D. R. .
BMJ OPEN, 2013, 3 (08)
[18]   AFP, AFP-L3, DCP, and GP73 as markers for monitoring treatment response and recurrence and as surrogate markers of clinicopathological variables of HCC [J].
Yamamoto, Kentaroh ;
Imamura, Hiroshi ;
Matsuyama, Yutaka ;
Kume, Yukio ;
Ikeda, Hitoshi ;
Norman, Gary L. ;
Shums, Zakera ;
Aoki, Taku ;
Hasegawa, Kiyoshi ;
Beck, Yoshifumi ;
Sugawara, Yasuhiko ;
Kokudo, Norihiro .
JOURNAL OF GASTROENTEROLOGY, 2010, 45 (12) :1272-1282