Extremely missing numerical data in Electronic Health Records for machine learning can be managed through simple imputation methods considering informative missingness: A comparative of solutions in a COVID-19 mortality case study

被引：6

作者：

Ferri, Pablo ^{[1
]}

Romero-Garcia, Nekane ^{[2
]}

Badenes, Rafael ^{[2
,3
,4
]}

Lora-Pablos, David ^{[5
,6
]}

Morales, Teresa Garcia ^{[5
]}

de la Camara, Agustin Gomez ^{[5
]}

Garcia-Gomez, Juan M. ^{[1
]}

Saez, Carlos ^{[1
]}

机构：

[1] Univ Politecn Valencia, Inst Univ Tecnol Informac & Comunicac, Biomed Data Sci Lab, Camino Vera S-N, Valencia 46022, Spain

[2] Univ Valencia, Dept Cirugia, Valencia, Spain

[3] Hosp Clin Univ Valencia, Inst INCLIVA, Valencia, Spain

[4] Hosp Clin Univ, Dept Anesthesiol, Surg Trauma Intens Care & Pain Clin, Valencia, Spain

[5] Hosp 12 Octubre, Inst Invest Imas12, Madrid, Spain

[6] Univ Complutense Madrid, Fac Estudios Estadist, Madrid, Spain

来源：

COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE | 2023年 / 242卷

关键词：

Machine learning; Missing data; Data imputation; Informative missingness; Electronic health records; COVID-19;

D O I：

10.1016/j.cmpb.2023.107803

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Background and objective: Reusing Electronic Health Records (EHRs) for Machine Learning (ML) leads on many occasions to extremely incomplete and sparse tabular datasets, which can hinder the model development processes and limit their performance and generalization. In this study, we aimed to characterize the most effective data imputation techniques and ML models for dealing with highly missing numerical data in EHRs, in the case where only a very limited number of data are complete, as opposed to the usual case of having a reduced number of missing values. Methods: We used a case study including full blood count laboratory data, demographic and survival data in the context of COVID-19 hospital admissions and evaluated 30 processing pipelines combining imputation methods with ML classifiers. The imputation methods included missing mask, translation and encoding, mean imputation, k-nearest neighbors' imputation, Bayesian ridge regression imputation and generative adversarial imputation networks. The classifiers included k-nearest neighbors, logistic regression, random forest, gradient boosting and deep multilayer perceptron. Results: Our results suggest that in the presence of highly missing data, combining translation and encoding imputation-which considers informative missingness-with tree ensemble classifiers-random forest and gradient boosting-is a sensible choice when aiming to maximize performance, in terms of area under curve. Conclusions: Based on our findings, we recommend the consideration of this imputer-classifier configuration when constructing models in the presence of extremely incomplete numerical data in EHR.

引用

页数：9

共 56 条

[1] Optuna: A Next-generation Hyperparameter Optimization Framework
Akiba, Takuya
Sano, Shotaro
Yanase, Toshihiko
Ohta, Takeru
Koyama, Masanori
[J]. KDD'19: PROCEEDINGS OF THE 25TH ACM SIGKDD INTERNATIONAL CONFERENCCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2019, : 2623 - 2631
[2] Ba JL., 2016, arXiv
[3] Machine-learning-based COVID-19 mortality prediction model and identification of patients at low and high risk of dying
Banoei, Mohammad M.
Dinparastisaleh, Roshan
Zadeh, Ali Vaeli
Mirsaeidi, Mehdi
[J]. CRITICAL CARE, 2021, 25 (01)
[4] Development of a "meta-model" to address missing data, predict patient-specific cancer survival and provide a foundation for clinical decision support
Baron, Jason M.
Paranjape, Ketan
Love, Tara
Sharma, Vishakha
Heaney, Denise
Prime, Matthew
[J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2021, 28 (03) : 605 - 615
[5] DYNAMIC PROGRAMMING AND LAGRANGE MULTIPLIERS
BELLMAN, R
[J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1956, 42 (10) : 767 - 769
[6] Bergstra JS, ALGORITHMS HYPERPARA, P9
[7] COVID-19 mortality risk assessment: An international multi-center study
Bertsimas, Dimitris
Lukin, Galit
Mingardi, Luca
Nohadani, Omid
Orfanoudaki, Agni
Stellato, Bartolomeo
Wiberg, Holly
Gonzalez-Garcia, Sara
Parra-Calderon, Carlos Luis
Robinson, Kenneth
Schneider, Michelle
Stein, Barry
Estirado, Alberto
Beccara, Lia
Canino, Rosario
Dal Bello, Martina
Pezzetti, Federica
Pan, Angelo
[J]. PLOS ONE, 2020, 15 (12):
[8] The use of the area under the roc curve in the evaluation of machine learning algorithms
Bradley, AP
[J]. PATTERN RECOGNITION, 1997, 30 (07) : 1145 - 1159
[9] A method for comparing multiple imputation techniques: A case study on the US national COVID cohort collaborative
Casiraghi, Elena
Wong, Rachel
Hall, Margaret
Coleman, Ben
Notaro, Marco
Evans, Michael D.
Tronieri, Jena S.
Blau, Hannah
Laraway, Bryan
Callahan, Tiffany J.
Chan, Lauren E.
Bramante, Carolyn T.
Buse, John B.
Moffitt, Richard A.
Sturmer, Til
Johnson, Steven G.
Shao, Yu Raymond
Reese, Justin
Robinson, Peter N.
Paccanaro, Alberto
Valentini, Giorgio
Huling, Jared D.
Wilkins, Kenneth J.
[J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2023, 139
[10] Explainable Machine Learning for Early Assessment of COVID-19 Risk Prediction in Emergency Departments
Casiraghi, Elena
Malchiodi, Dario
Trucco, Gabriella
Frasca, Marco
Cappelletti, Luca
Fontana, Tommaso
Esposito, Alessandro Andrea
Avola, Emanuele
Jachetti, Alessandro
Reese, Justin
Rizzi, Alessandro
Robinson, Peter N.
Valentini, Giorgio
[J]. IEEE ACCESS, 2020, 8 (08): : 196299 - 196325

← 1 2 3 4 5 6 →