Extremely missing numerical data in Electronic Health Records for machine learning can be managed through simple imputation methods considering informative missingness: A comparative of solutions in a COVID-19 mortality case study

被引:6
作者
Ferri, Pablo [1 ]
Romero-Garcia, Nekane [2 ]
Badenes, Rafael [2 ,3 ,4 ]
Lora-Pablos, David [5 ,6 ]
Morales, Teresa Garcia [5 ]
de la Camara, Agustin Gomez [5 ]
Garcia-Gomez, Juan M. [1 ]
Saez, Carlos [1 ]
机构
[1] Univ Politecn Valencia, Inst Univ Tecnol Informac & Comunicac, Biomed Data Sci Lab, Camino Vera S-N, Valencia 46022, Spain
[2] Univ Valencia, Dept Cirugia, Valencia, Spain
[3] Hosp Clin Univ Valencia, Inst INCLIVA, Valencia, Spain
[4] Hosp Clin Univ, Dept Anesthesiol, Surg Trauma Intens Care & Pain Clin, Valencia, Spain
[5] Hosp 12 Octubre, Inst Invest Imas12, Madrid, Spain
[6] Univ Complutense Madrid, Fac Estudios Estadist, Madrid, Spain
关键词
Machine learning; Missing data; Data imputation; Informative missingness; Electronic health records; COVID-19;
D O I
10.1016/j.cmpb.2023.107803
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Background and objective: Reusing Electronic Health Records (EHRs) for Machine Learning (ML) leads on many occasions to extremely incomplete and sparse tabular datasets, which can hinder the model development processes and limit their performance and generalization. In this study, we aimed to characterize the most effective data imputation techniques and ML models for dealing with highly missing numerical data in EHRs, in the case where only a very limited number of data are complete, as opposed to the usual case of having a reduced number of missing values. Methods: We used a case study including full blood count laboratory data, demographic and survival data in the context of COVID-19 hospital admissions and evaluated 30 processing pipelines combining imputation methods with ML classifiers. The imputation methods included missing mask, translation and encoding, mean imputation, k-nearest neighbors' imputation, Bayesian ridge regression imputation and generative adversarial imputation networks. The classifiers included k-nearest neighbors, logistic regression, random forest, gradient boosting and deep multilayer perceptron. Results: Our results suggest that in the presence of highly missing data, combining translation and encoding imputation-which considers informative missingness-with tree ensemble classifiers-random forest and gradient boosting-is a sensible choice when aiming to maximize performance, in terms of area under curve. Conclusions: Based on our findings, we recommend the consideration of this imputer-classifier configuration when constructing models in the presence of extremely incomplete numerical data in EHR.
引用
收藏
页数:9
相关论文
共 56 条
  • [1] Optuna: A Next-generation Hyperparameter Optimization Framework
    Akiba, Takuya
    Sano, Shotaro
    Yanase, Toshihiko
    Ohta, Takeru
    Koyama, Masanori
    [J]. KDD'19: PROCEEDINGS OF THE 25TH ACM SIGKDD INTERNATIONAL CONFERENCCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2019, : 2623 - 2631
  • [2] Ba JL., 2016, arXiv
  • [3] Machine-learning-based COVID-19 mortality prediction model and identification of patients at low and high risk of dying
    Banoei, Mohammad M.
    Dinparastisaleh, Roshan
    Zadeh, Ali Vaeli
    Mirsaeidi, Mehdi
    [J]. CRITICAL CARE, 2021, 25 (01)
  • [4] Development of a "meta-model" to address missing data, predict patient-specific cancer survival and provide a foundation for clinical decision support
    Baron, Jason M.
    Paranjape, Ketan
    Love, Tara
    Sharma, Vishakha
    Heaney, Denise
    Prime, Matthew
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2021, 28 (03) : 605 - 615
  • [6] Bergstra JS, ALGORITHMS HYPERPARA, P9
  • [7] COVID-19 mortality risk assessment: An international multi-center study
    Bertsimas, Dimitris
    Lukin, Galit
    Mingardi, Luca
    Nohadani, Omid
    Orfanoudaki, Agni
    Stellato, Bartolomeo
    Wiberg, Holly
    Gonzalez-Garcia, Sara
    Parra-Calderon, Carlos Luis
    Robinson, Kenneth
    Schneider, Michelle
    Stein, Barry
    Estirado, Alberto
    Beccara, Lia
    Canino, Rosario
    Dal Bello, Martina
    Pezzetti, Federica
    Pan, Angelo
    [J]. PLOS ONE, 2020, 15 (12):
  • [8] The use of the area under the roc curve in the evaluation of machine learning algorithms
    Bradley, AP
    [J]. PATTERN RECOGNITION, 1997, 30 (07) : 1145 - 1159
  • [9] A method for comparing multiple imputation techniques: A case study on the US national COVID cohort collaborative
    Casiraghi, Elena
    Wong, Rachel
    Hall, Margaret
    Coleman, Ben
    Notaro, Marco
    Evans, Michael D.
    Tronieri, Jena S.
    Blau, Hannah
    Laraway, Bryan
    Callahan, Tiffany J.
    Chan, Lauren E.
    Bramante, Carolyn T.
    Buse, John B.
    Moffitt, Richard A.
    Sturmer, Til
    Johnson, Steven G.
    Shao, Yu Raymond
    Reese, Justin
    Robinson, Peter N.
    Paccanaro, Alberto
    Valentini, Giorgio
    Huling, Jared D.
    Wilkins, Kenneth J.
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2023, 139
  • [10] Explainable Machine Learning for Early Assessment of COVID-19 Risk Prediction in Emergency Departments
    Casiraghi, Elena
    Malchiodi, Dario
    Trucco, Gabriella
    Frasca, Marco
    Cappelletti, Luca
    Fontana, Tommaso
    Esposito, Alessandro Andrea
    Avola, Emanuele
    Jachetti, Alessandro
    Reese, Justin
    Rizzi, Alessandro
    Robinson, Peter N.
    Valentini, Giorgio
    [J]. IEEE ACCESS, 2020, 8 (08): : 196299 - 196325