Missing Data Imputation via Denoising Autoencoders: The Untold Story

被引:31
作者
Costa, Adriana Fonseca [1 ]
Santos, Miriam Seoane [1 ]
Soares, Jastin Pompeu [1 ]
Abreu, Pedro Henriques [1 ]
机构
[1] Univ Coimbra, Dept Informat Engn, CISUC, Coimbra, Portugal
来源
ADVANCES IN INTELLIGENT DATA ANALYSIS XVII, IDA 2018 | 2018年 / 11191卷
关键词
Missing data; Missing mechanisms; Data imputation; Denoising autoencoders; SURVIVAL PREDICTION; INCOMPLETE DATA;
D O I
10.1007/978-3-030-01768-2_8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Missing data consists in the lack of information in a dataset and since it directly influences classification performance, neglecting it is not a valid option. Over the years, several studies presented alternative imputation strategies to deal with the three missing data mechanisms, Missing Completely At Random, Missing At Random and Missing Not At Random. However, there are no studies regarding the influence of all these three mechanisms on the latest high-performance Artificial Intelligence techniques, such as Deep Learning. The goal of this work is to perform a comparison study between state-of-the-art imputation techniques and a Stacked Denoising Autoencoders approach. To that end, the missing data mechanisms were synthetically generated in 6 different ways; 8 different imputation techniques were implemented; and finally, 33 complete datasets from different open source repositories were selected. The obtained results showed that Support Vector Machines imputation ensures the best classification performance while Multiple Imputation by Chained Equations performs better in terms of imputation quality.
引用
收藏
页码:87 / 98
页数:12
相关论文
共 27 条
[1]   Predicting Breast Cancer Recurrence Using Machine Learning Techniques: A Systematic Review [J].
Abreu, Pedro Henriques ;
Santos, Miriam Seoane ;
Abreu, Miguel Henriques ;
Andrade, Bruno ;
Silva, Daniel Castro .
ACM COMPUTING SURVEYS, 2016, 49 (03)
[2]  
Amorim JP, 2018, 26 EUR S ART NEUR NE, P373
[3]  
[Anonymous], 2008, ICML 08, DOI 10.1145/1390156.1390294
[4]  
[Anonymous], 2012, CoRR
[5]  
[Anonymous], 1987, Statistical analysis with missing data
[6]   Multiple imputation by chained equations: what is it and how does it work? [J].
Azur, Melissa J. ;
Stuart, Elizabeth A. ;
Frangakis, Constantine ;
Leaf, Philip J. .
INTERNATIONAL JOURNAL OF METHODS IN PSYCHIATRIC RESEARCH, 2011, 20 (01) :40-49
[7]  
Beaulieu-Jones BK, 2017, BIOCOMPUT-PAC SYM, P207, DOI 10.1142/9789813207813_0021
[8]   A practical tutorial on autoencoders for nonlinear feature fusion: Taxonomy, models, software and guidelines [J].
Charte, David ;
Charte, Francisco ;
Garcia, Salvador ;
del Jesus, Maria J. ;
Herrera, Francisco .
INFORMATION FUSION, 2018, 44 :78-96
[9]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[10]   An efficient realization of deep learning for traffic data imputation [J].
Duan, Yanjie ;
Lv, Yisheng ;
Liu, Yu-Liang ;
Wang, Fei-Yue .
TRANSPORTATION RESEARCH PART C-EMERGING TECHNOLOGIES, 2016, 72 :168-181