From Predictive Methods to Missing Data Imputation: An Optimization Approach

被引：0

作者：

Bertsimas, Dimitris ^{[1
]}

Pawlowski, Colin

Zhuo, Ying Daisy

机构：

[1] MIT, Sloan Sch Management, 77 Massachusetts Ave, Cambridge, MA 02139 USA

来源：

JOURNAL OF MACHINE LEARNING RESEARCH | 2018年 / 18卷

基金：

美国国家科学基金会;

关键词：

missing data imputation; K-NN; SVM; optimal decision trees; GENE-EXPRESSION DATA; MULTIPLE IMPUTATION; REGRESSION; VALUES;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Missing data is a common problem in real-world settings and for this reason has attracted significant attention in the statistical literature. We propose a flexible framework based on formal optimization to impute missing data with mixed continuous and categorical variables. This framework can readily incorporate various predictive models including K-nearest neighbors, support vector machines, and decision tree based methods, and can be adapted for multiple imputation. We derive fast first-order methods that obtain high quality solutions in seconds following a general imputation algorithm opt . impute presented in this paper. We demonstrate that our proposed method improves out-of-sample accuracy in large-scale computational experiments across a sample of 84 data sets taken from the UCI Machine Learning Repository. In all scenarios of missing at random mechanisms and various missing percentages, opt . impute produces the best overall imputation in most data sets benchmarked against five other methods: mean impute, K-nearest neighbors, iterative knn, Bayesian PCA, and predictive-mean matching, with an average reduction in mean absolute error of 8.3% against the best cross-validated benchmark method. Moreover, opt. impute leads to improved out-of-sample performance of learning algorithms trained using the imputed data, demonstrated by computational experiments on 10 downstream tasks. For models trained using opt . impute single imputations with 50% data missing, the average out-of-sample R-2 is 0.339 in the regression tasks and the average out-of-sample accuracy is 86.1% in the classification tasks, compared to 0.315 and 84.4% for the best cross-validated benchmark method. In the multiple imputation setting, downstream models trained using opt . impute obtain a statistically significant improvement over models trained using multivariate imputation by chained equations (mice) in 8/10 missing data scenarios considered.

引用

页数：39

共 33 条

[1]

[Anonymous], 1999, Athena scientific Belmont

[2]

[Anonymous], 2001, Journal of Machine Learning Research

[3]

[Anonymous], 1987, Statistical analysis with missing data

[4]

[Anonymous], 2009, Proceedings of Advances in Neural Information Processing Systems (NIPS)

[5]

Bertsimas D., 2017, ARXIV170910029

[6] Data-driven robust optimization [J].

Bertsimas, Dimitris ;

Gupta, Vishal ;

Kallus, Nathan .

MATHEMATICAL PROGRAMMING, 2018, 167 (02) :235-292

[7] LEAST QUANTILE REGRESSION VIA MODERN OPTIMIZATION [J].

Bertsimas, Dimitris ;

Mazumder, Rahul .

ANNALS OF STATISTICS, 2014, 42 (06) :2494-2525

[8]

Bertsimas Dimitris, 2017, ROBUST CLASSIF UNPUB

[9] LSimpute: accurate estimation of missing values in microarray data with least squares methods [J].

Bo, TH ;

Dysvik, J ;

Jonassen, I .

NUCLEIC ACIDS RESEARCH, 2004, 32 (03) :e34

[10] Improving cluster-based missing value estimation of DNA microarray data [J].

Bras, Ligia P. ;

Menezes, Jose C. .

BIOMOLECULAR ENGINEERING, 2007, 24 (02) :273-282

← 1 2 3 4 →