From Predictive Methods to Missing Data Imputation: An Optimization Approach

被引:0
作者
Bertsimas, Dimitris [1 ]
Pawlowski, Colin
Zhuo, Ying Daisy
机构
[1] MIT, Sloan Sch Management, 77 Massachusetts Ave, Cambridge, MA 02139 USA
基金
美国国家科学基金会;
关键词
missing data imputation; K-NN; SVM; optimal decision trees; GENE-EXPRESSION DATA; MULTIPLE IMPUTATION; REGRESSION; VALUES;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Missing data is a common problem in real-world settings and for this reason has attracted significant attention in the statistical literature. We propose a flexible framework based on formal optimization to impute missing data with mixed continuous and categorical variables. This framework can readily incorporate various predictive models including K-nearest neighbors, support vector machines, and decision tree based methods, and can be adapted for multiple imputation. We derive fast first-order methods that obtain high quality solutions in seconds following a general imputation algorithm opt . impute presented in this paper. We demonstrate that our proposed method improves out-of-sample accuracy in large-scale computational experiments across a sample of 84 data sets taken from the UCI Machine Learning Repository. In all scenarios of missing at random mechanisms and various missing percentages, opt . impute produces the best overall imputation in most data sets benchmarked against five other methods: mean impute, K-nearest neighbors, iterative knn, Bayesian PCA, and predictive-mean matching, with an average reduction in mean absolute error of 8.3% against the best cross-validated benchmark method. Moreover, opt. impute leads to improved out-of-sample performance of learning algorithms trained using the imputed data, demonstrated by computational experiments on 10 downstream tasks. For models trained using opt . impute single imputations with 50% data missing, the average out-of-sample R-2 is 0.339 in the regression tasks and the average out-of-sample accuracy is 86.1% in the classification tasks, compared to 0.315 and 84.4% for the best cross-validated benchmark method. In the multiple imputation setting, downstream models trained using opt . impute obtain a statistically significant improvement over models trained using multivariate imputation by chained equations (mice) in 8/10 missing data scenarios considered.
引用
收藏
页数:39
相关论文
共 33 条
[1]  
[Anonymous], 1999, Athena scientific Belmont
[2]  
[Anonymous], 2001, Journal of Machine Learning Research
[3]  
[Anonymous], 1987, Statistical analysis with missing data
[4]  
[Anonymous], 2009, Proceedings of Advances in Neural Information Processing Systems (NIPS)
[5]  
Bertsimas D., 2017, ARXIV170910029
[6]   Data-driven robust optimization [J].
Bertsimas, Dimitris ;
Gupta, Vishal ;
Kallus, Nathan .
MATHEMATICAL PROGRAMMING, 2018, 167 (02) :235-292
[7]   LEAST QUANTILE REGRESSION VIA MODERN OPTIMIZATION [J].
Bertsimas, Dimitris ;
Mazumder, Rahul .
ANNALS OF STATISTICS, 2014, 42 (06) :2494-2525
[8]  
Bertsimas Dimitris, 2017, ROBUST CLASSIF UNPUB
[9]   LSimpute: accurate estimation of missing values in microarray data with least squares methods [J].
Bo, TH ;
Dysvik, J ;
Jonassen, I .
NUCLEIC ACIDS RESEARCH, 2004, 32 (03) :e34
[10]   Improving cluster-based missing value estimation of DNA microarray data [J].
Bras, Ligia P. ;
Menezes, Jose C. .
BIOMOLECULAR ENGINEERING, 2007, 24 (02) :273-282