Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

被引：139

作者：

Hong, Shangzhi ^{[1
]}

Lynn, Henry S. ^{[1
]}

机构：

[1] Fudan Univ, Sch Publ Hlth, Dept Biostat, Key Lab Publ Hlth Safety,Minist Educ, Shanghai, Peoples R China

来源：

BMC MEDICAL RESEARCH METHODOLOGY | 2020年 / 20卷 / 01期

关键词：

Missing data imputation; Imputation accuracy; Random forest; MICE;

D O I：

10.1186/s12874-020-01080-1

中图分类号：

R19 [保健组织与事业（卫生事业管理）];

学科分类号：

摘要：

Background Missing data are common in statistical analyses, and imputation methods based on random forests (RF) are becoming popular for handling missing data especially in biomedical research. Unlike standard imputation approaches, RF-based imputation methods do not assume normality or require specification of parametric models. However, it is still inconclusive how they perform for non-normally distributed data or when there are non-linear relationships or interactions. Methods To examine the effects of these three factors, a variety of datasets were simulated with outcome-dependent missing at random (MAR) covariates, and the performances of the RF-based imputation methods missForest and CALIBERrfimpute were evaluated in comparison with predictive mean matching (PMM). Results Both missForest and CALIBERrfimpute have high predictive accuracy but missForest can produce severely biased regression coefficient estimates and downward biased confidence interval coverages, especially for highly skewed variables in nonlinear models. CALIBERrfimpute typically outperforms missForest when estimating regression coefficients, although its biases are still substantial and can be worse than PMM for logistic regression relationships with interaction. Conclusions RF-based imputation, in particular missForest, should not be indiscriminately recommended as a panacea for imputing missing data, especially when data are highly skewed and/or outcome-dependent MAR. A correct analysis requires a careful critique of the missing data mechanism and the inter-relationships between the variables in the data.

引用

页数：12

共 18 条

[1] What is the difference between missing completely at random and missing at random? [J].

Bhaskaran, Krishnan ;

Smeeth, Liam .

INTERNATIONAL JOURNAL OF EPIDEMIOLOGY, 2014, 43 (04) :1336-1339

[2] Random forests [J].

Breiman, L .

MACHINE LEARNING, 2001, 45 (01) :5-32

[3] LI-RADS for MR Imaging Diagnosis of Hepatocellular Carcinoma: Performance of Major and Ancillary Features [J].

Cerny, Milena ;

Bergeron, Catherine ;

Billiard, Jean-Sebastien ;

Murphy-Lavallee, Jessica ;

Olivie, Damien ;

Berube, Joshua ;

Fan, Boyan ;

Castel, Helene ;

Turcotte, Simon ;

Perreault, Pierre ;

Chagnon, Miguel ;

Tang, An .

RADIOLOGY, 2018, 288 (01) :118-128

[4] A CONCORDANCE CORRELATION-COEFFICIENT TO EVALUATE REPRODUCIBILITY [J].

LIN, LI .

BIOMETRICS, 1989, 45 (01) :255-268

[5]

Mebane WR Jr, 2011, J STAT SOFTW, V42, P1

[6] A Bayesian missing value estimation method for gene expression profile data [J].

Oba, S ;

Sato, M ;

Takemasa, I ;

Monden, M ;

Matsubara, K ;

Ishii, S .

BIOINFORMATICS, 2003, 19 (16) :2088-2096

[7]

R Core Team, 2020, R LANG ENV STAT COMP

[8] Predicting missing values: a comparative study on non-parametric approaches for imputation [J].

Ramosaj, Burim ;

Pauly, Markus .

COMPUTATIONAL STATISTICS, 2019, 34 (04) :1741-1764

[9] INFERENCE AND MISSING DATA [J].

RUBIN, DB .

BIOMETRIKA, 1976, 63 (03) :581-590

[10] Generating missing values for simulation purposes: a multivariate amputation procedure [J].

Schouten, Rianne Margaretha ;

Lugtig, Peter ;

Vink, Gerko .

JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2018, 88 (15) :2909-2930

← 1 2 →