A Benchmark for Missing Data Imputation Techniques: Development Perspectives and Performance Comparative

被引:0
作者
Cabrera-Sanchez, Juan-Francisco [1 ]
Cruz-Corona, Carlos [2 ]
Escolano, Andres Yanez [1 ]
Silva-Ramirez, Esther-Lydia [1 ]
机构
[1] Univ Cadiz, Dept Comp Sci & Engn, Puerto Real, Spain
[2] Univ Granada, Dept Comp Sci & Artificial Intelligence, Granada, Spain
来源
OPTIMIZATION AND LEARNING, OLA 2024 | 2025年 / 2311卷
关键词
Missing data; Machine Learning; Deep Learning; Autoencoders;
D O I
10.1007/978-3-031-77941-1_11
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Knowledge extraction from information stored in databases is always subject to the presence of missing values. Missing data is an unavoidable problem that affects many disciplines of researchers and data scientists. Inasmuch as machine learning algorithms cannot work with incomplete data in the data sets, data imputation is an essential task to obtain quality data. This research approach provides an overview of the data missingness mechanism and the process of generating synthetic missing data, the imputation of all types of variables, and the performance assessment of several imputation methods. Traditional algorithms, Machine Learning methods and various Autoencoder-based deep learning architectures have been studied. An exhaustive analysis and comparison of 21 heterogeneous data sets in various areas has been proposed. They have been exposed to a perturbation procedure with different missingness mechanisms and various missingness rates, covering the different possibilities that can occur in real life. The experimental results show that deep learning models outperform the other methods studied. Furthermore, the performance of data imputation methods does not depend on the missingness mechanism or the synthetic missingness generation method used nor on the percentage of missing values.
引用
收藏
页码:140 / 153
页数:14
相关论文
共 20 条
  • [1] Establishing strong imputation performance of a denoising autoencoder in a wide range of missing data problems
    Abiri, Najmeh
    Linse, Bjorn
    Eden, Patrik
    Ohlsson, Mattias
    [J]. NEUROCOMPUTING, 2019, 365 : 137 - 146
  • [2] Beaulieu-Jones BK, 2017, BIOCOMPUT-PAC SYM, P207, DOI 10.1142/9789813207813_0021
  • [3] Bishop C., 1995, Neural Networks For Pattern Recognition
  • [4] Dua D., 2017, UCI Machine Learning Repository
  • [5] Pattern classification with missing data: a review
    Garcia-Laencina, Pedro J.
    Sancho-Gomez, Jose-Luis
    Figueiras-Vidal, Anibal R.
    [J]. NEURAL COMPUTING & APPLICATIONS, 2010, 19 (02) : 263 - 282
  • [6] An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers
    Garciarena, Unai
    Santana, Roberto
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2017, 89 : 52 - 65
  • [7] Gondara Lovedeep, 2018, Advances in Knowledge Discovery and Data Mining. 22nd Pacific-Asia Conference, PAKDD 2018. Proceedings: LNAI 10939, P260, DOI 10.1007/978-3-319-93040-4_21
  • [8] A Benchmark for Data Imputation Methods
    Jaeger, Sebastian
    Allhorn, Arndt
    Biessmann, Felix
    [J]. FRONTIERS IN BIG DATA, 2021, 4
  • [9] Missing data imputation using statistical and machine learning methods in a real breast cancer problem
    Jerez, Jose M.
    Molina, Ignacio
    Garcia-Laencina, Pedro J.
    Alba, Emilio
    Ribelles, Nuria
    Martin, Miguel
    Franco, Leonardo
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2010, 50 (02) : 105 - 115
  • [10] An Experimental Survey of Missing Data Imputation Algorithms
    Miao, Xiaoye
    Wu, Yangyang
    Chen, Lu
    Gao, Yunjun
    Yin, Jianwei
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (07) : 6630 - 6650