A Benchmark for Missing Data Imputation Techniques: Development Perspectives and Performance Comparative

被引：0

作者：

Cabrera-Sanchez, Juan-Francisco ^{[1
]}

Cruz-Corona, Carlos ^{[2
]}

Escolano, Andres Yanez ^{[1
]}

Silva-Ramirez, Esther-Lydia ^{[1
]}

机构：

[1] Univ Cadiz, Dept Comp Sci & Engn, Puerto Real, Spain

[2] Univ Granada, Dept Comp Sci & Artificial Intelligence, Granada, Spain

来源：

OPTIMIZATION AND LEARNING, OLA 2024 | 2025年 / 2311卷

关键词：

Missing data; Machine Learning; Deep Learning; Autoencoders;

D O I：

10.1007/978-3-031-77941-1_11

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Knowledge extraction from information stored in databases is always subject to the presence of missing values. Missing data is an unavoidable problem that affects many disciplines of researchers and data scientists. Inasmuch as machine learning algorithms cannot work with incomplete data in the data sets, data imputation is an essential task to obtain quality data. This research approach provides an overview of the data missingness mechanism and the process of generating synthetic missing data, the imputation of all types of variables, and the performance assessment of several imputation methods. Traditional algorithms, Machine Learning methods and various Autoencoder-based deep learning architectures have been studied. An exhaustive analysis and comparison of 21 heterogeneous data sets in various areas has been proposed. They have been exposed to a perturbation procedure with different missingness mechanisms and various missingness rates, covering the different possibilities that can occur in real life. The experimental results show that deep learning models outperform the other methods studied. Furthermore, the performance of data imputation methods does not depend on the missingness mechanism or the synthetic missingness generation method used nor on the percentage of missing values.

引用

页码：140 / 153

页数：14

共 20 条

[1] Establishing strong imputation performance of a denoising autoencoder in a wide range of missing data problems
Abiri, Najmeh
Linse, Bjorn
Eden, Patrik
Ohlsson, Mattias
[J]. NEUROCOMPUTING, 2019, 365 : 137 - 146
[2] Beaulieu-Jones BK, 2017, BIOCOMPUT-PAC SYM, P207, DOI 10.1142/9789813207813_0021
[3] Bishop C., 1995, Neural Networks For Pattern Recognition
[4] Dua D., 2017, UCI Machine Learning Repository
[5] Pattern classification with missing data: a review
Garcia-Laencina, Pedro J.
Sancho-Gomez, Jose-Luis
Figueiras-Vidal, Anibal R.
[J]. NEURAL COMPUTING & APPLICATIONS, 2010, 19 (02) : 263 - 282
[6] An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers
Garciarena, Unai
Santana, Roberto
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2017, 89 : 52 - 65
[7] Gondara Lovedeep, 2018, Advances in Knowledge Discovery and Data Mining. 22nd Pacific-Asia Conference, PAKDD 2018. Proceedings: LNAI 10939, P260, DOI 10.1007/978-3-319-93040-4_21
[8] A Benchmark for Data Imputation Methods
Jaeger, Sebastian
Allhorn, Arndt
Biessmann, Felix
[J]. FRONTIERS IN BIG DATA, 2021, 4
[9] Missing data imputation using statistical and machine learning methods in a real breast cancer problem
Jerez, Jose M.
Molina, Ignacio
Garcia-Laencina, Pedro J.
Alba, Emilio
Ribelles, Nuria
Martin, Miguel
Franco, Leonardo
[J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2010, 50 (02) : 105 - 115
[10] An Experimental Survey of Missing Data Imputation Algorithms
Miao, Xiaoye
Wu, Yangyang
Chen, Lu
Gao, Yunjun
Yin, Jianwei
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (07) : 6630 - 6650

← 1 2 →