A novel missing data imputation approach based on clinical conditional Generative Adversarial Networks applied to EHR datasets

被引:8
作者
Bernardini, Michele [1 ]
Doinychko, Anastasiia [4 ]
Romeo, Luca [2 ]
Frontoni, Emanuele [3 ]
Amini, Massih-Reza [4 ]
机构
[1] Univ Politecn Marche, Dept Informat Engn DII, Ancona, Italy
[2] Univ Macerata, Dept Econ & Law, Macerata, Italy
[3] Univ Macerata, Dept Polit Sci Commun & Int Relat, Macerata, Italy
[4] Univ Grenoble Alpes, Grenoble Informat Lab, St Martin Dheres, France
关键词
Data imputation; Generative Adversarial Network; Electronic Health Record; Machine Learning; Predictive medicine; TIME-SERIES; MODEL;
D O I
10.1016/j.compbiomed.2023.107188
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
The missing data mechanism is a relevant problem in Machine Learning (ML) and biomedical informatics communities. Real-world Electronic Health Record (EHR) datasets comprise several missing values, thus revealing a high level of spatiotemporal sparsity in the predictors' matrix. Several approaches in the state-of-the-art tried to deal with this problem by proposing different data imputation strategies that (i) are often unrelated to the ML model, (ii) are not conceived for EHR data where laboratory exams are not prescribed uniformly over time and percentage of missing values is high (iii) exploit only univariate and linear information on the observed features. Our paper proposes a data imputation strategy based on a clinical conditional Generative Adversarial Network (ccGAN) capable of imputing missing values by exploiting non-linear and multivariate information across patients. Unlike other GAN data imputation-based approaches, our method deals explicitly with the high level of missingness of routine EHR data by conditioning the imputing strategy to the observable values and those fully-annotated. We demonstrated the statistical significance of the ccGAN to other state-of-the-art approaches in terms of imputation (around 19.79% of gain to the best competitor) and predictive performance (up to 1.60% of gain to the best competitor) on a real multi-diabetic centers dataset. We also demonstrated its robustness across different missingness rates (up to 1.61% of gain to the best competitor in the highest missingness rates condition) on an additional benchmark EHR dataset.
引用
收藏
页数:10
相关论文
共 38 条
  • [1] [Anonymous], 2020, Diabetic retinopathy screening: a short guide
  • [2] Imputation of missing data with class imbalance using conditional generative adversarial networks
    Awan, Saqib Ejaz
    Bennamoun, Mohammed
    Sohel, Ferdous
    Sanfilippo, Frank
    Dwivedi, Girish
    [J]. NEUROCOMPUTING, 2021, 453 : 164 - 171
  • [3] TyG-er: An ensemble Regression Forest approach for identification of clinical factors related to insulin resistance condition using Electronic Health Records
    Bernardini, Michele
    Morettini, Micaela
    Romeo, Luca
    Frontoni, Emanuele
    Burattini, Laura
    [J]. COMPUTERS IN BIOLOGY AND MEDICINE, 2019, 112
  • [4] Discovering the Type 2 Diabetes in Electronic Health Records Using the Sparse Balanced Support Vector Machine
    Bernardini, Michele
    Romeo, Luca
    Misericordia, Paolo
    Frontoni, Emanuele
    [J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2020, 24 (01) : 235 - 246
  • [5] Bora Ashish, 2018, INT C LEARNING REPRE
  • [6] Benchmarking PySyft Federated Learning Framework on MIMIC-III Dataset
    Budrionis, Andrius
    Miara, Magda
    Miara, Piotr
    Wilk, Szymon
    Bellika, Johan Gustav
    [J]. IEEE ACCESS, 2021, 9 (09): : 116869 - 116878
  • [7] Cawley GC, 2010, J MACH LEARN RES, V11, P2079
  • [8] Recurrent Neural Networks for Multivariate Time Series with Missing Values
    Che, Zhengping
    Purushotham, Sanjay
    Cho, Kyunghyun
    Sontag, David
    Liu, Yan
    [J]. SCIENTIFIC REPORTS, 2018, 8
  • [9] Combining attention with spectrum to handle missing values on time series data without imputation
    Chen, Yen -Pin
    Huang, Chien-Hua
    Lo, Yuan-Hsun
    Chen, Yi-Ying
    Lai, Feipei
    [J]. INFORMATION SCIENCES, 2022, 609 : 1271 - 1287
  • [10] European Commission, 2019, ETH GUID TRUSTW AI