Multiple Imputation for Robust Cluster Analysis to Address Missingness in Medical Data

被引:1
作者
Harder, Arnold A. [1 ]
Olbricht, Gayla R. [1 ,2 ]
Ekuma, Godwin [3 ]
Hier, Daniel B. [2 ]
Obafemi-Ajayi, Tayo [2 ,4 ]
机构
[1] Missouri Univ Sci & Technol, Dept Math & Stat, Rolla, MO 65409 USA
[2] Missouri Univ Sci & Technol, Dept Elect & Comp Engn, Appl Computat Intelligence Lab, Rolla, MO 65409 USA
[3] Missouri State Univ, Dept Comp Sci, Springfield, MO 65897 USA
[4] Missouri State Univ, Engn Program, Springfield, MO 65897 USA
关键词
Multiple data imputation; clustering; ensemble learning; canonical discriminant analysis; mixture models; traumatic brain injury; missingness; INFERENCE; MODELS; MICE;
D O I
10.1109/ACCESS.2024.3377242
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cluster analysis has been applied to a wide range of problems as an exploratory tool to enhance knowledge discovery. Clustering aids disease subtyping, i.e. identifying homogeneous patient subgroups, in medical data. Missing data is a common problem in medical research and could bias clustering results if not properly handled. Yet, multiple imputation has been under-utilized to address missingness, when clustering medical data. Its limited integration in clustering of medical data, despite the known advantages and benefits of multiple imputation, could be attributed to many factors. This includes methodological complexity, difficulties in pooling results to obtain a consensus clustering, uncertainty regarding quality metrics, and a lack of accepted pipelines. A few studies have examined the feasibility of implementing multiple imputation for cluster analysis on simulated/small datasets. While these studies have begun to address how to pool imputed values and quantify uncertainty in clustering due to imputation, a need remains for a complete framework that integrates MI in the clustering of complex medical data and sophisticated cluster algorithms. We propose a cluster analysis framework that mitigates bias and addresses these limitations. It includes methods to pool multiple imputed datasets, create a consensus cluster solution by ensemble methods, and select an optimal number of clusters based on validity indices. It also estimates uncertainty about cluster membership attributable to the imputation and identifies features that characterize the derived clusters. The utility of this framework is illustrated by its application to a traumatic brain injury dataset with missing data. Our analysis revealed six multifaceted clusters that differed with respect to Glasgow Coma Score (GCS), mechanism of injury, sociodemographics, vitals, lab values, and radiological presentation. The most severe cluster consisted of single, relatively young patients injured by motor accident, with higher GCS severity scores. Comparative analysis with the miclust R package, along with statistical validation of cluster characterization, demonstrates its robust performance.
引用
收藏
页码:42974 / 42991
页数:18
相关论文
共 58 条
  • [1] Clustering identifies endotypes of traumatic brain injury in an intensive care cohort: a CENTER-TBI study
    Akerlund, Cecilia A., I
    Holst, Anders
    Stocchetti, Nino
    Steyerberg, Ewout W.
    Menon, David K.
    Ercole, Ari
    Nelson, David W.
    [J]. CRITICAL CARE, 2022, 26 (01)
  • [2] Al-jabery K.K., 2020, COMPUTATIONAL LEARNI, P125
  • [3] Imputation Strategies for Clustering Mixed-Type Data with Missing Values
    Aschenbruck, Rabea
    Szepannek, Gero
    Wilhelm, Adalbert F. X.
    [J]. JOURNAL OF CLASSIFICATION, 2023, 40 (01) : 2 - 24
  • [4] Clustering with missing data: which equivalent for Rubin's rules?
    Audigier, Vincent
    Niang, Ndeye
    [J]. ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2023, 17 (03) : 623 - 657
  • [5] A Framework for Multiple Imputation in Cluster Analysis
    Basagana, Xavier
    Barrera-Gomez, Jose
    Benet, Marta
    Anto, Josep M.
    Garcia-Aymerich, Judith
    [J]. AMERICAN JOURNAL OF EPIDEMIOLOGY, 2013, 177 (07) : 718 - 725
  • [6] Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis
    Beaulieu-Jones, Brett K.
    Lavage, Daniel R.
    Snyder, John W.
    Moore, Jason H.
    Pendergrass, Sarah A.
    Bauer, Christopher R.
    [J]. JMIR MEDICAL INFORMATICS, 2018, 6 (01)
  • [7] A unifying criterion for unsupervised clustering and feature selection
    Breaban, Mihaela
    Luchian, Henri
    [J]. PATTERN RECOGNITION, 2011, 44 (04) : 854 - 865
  • [8] Clustering multiply imputed multivariate high-dimensional longitudinal profiles
    Bruckers, Liesbeth
    Molenberghs, Geert
    Dendale, Paul
    [J]. BIOMETRICAL JOURNAL, 2017, 59 (05) : 998 - 1015
  • [9] Explainable Machine Learning for Early Assessment of COVID-19 Risk Prediction in Emergency Departments
    Casiraghi, Elena
    Malchiodi, Dario
    Trucco, Gabriella
    Frasca, Marco
    Cappelletti, Luca
    Fontana, Tommaso
    Esposito, Alessandro Andrea
    Avola, Emanuele
    Jachetti, Alessandro
    Reese, Justin
    Rizzi, Alessandro
    Robinson, Peter N.
    Valentini, Giorgio
    [J]. IEEE ACCESS, 2020, 8 (08): : 196299 - 196325
  • [10] A manifesto on explainability for artificial intelligence in medicine
    Combi, Carlo
    Amico, Beatrice
    Bellazzi, Riccardo
    Holzinger, Andreas
    Moore, Jason H.
    Zitnik, Marinka
    Holmes, John H.
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2022, 133