Unsupervised machine learning for disease prediction: a comparative performance analysis using multiple datasets

被引:4
作者
Lu, Haohui [1 ]
Uddin, Shahadat [1 ]
机构
[1] Univ Sydney, Fac Engn, Sch Project Management, Level 2,21 Ross St, Forest Lodge, NSW 2037, Australia
关键词
Disease prediction; Performance comparison; Unsupervised machine learning; Healthcare dataset; DIAGNOSIS;
D O I
10.1007/s12553-023-00805-8
中图分类号
R-058 [];
学科分类号
摘要
PurposeDisease risk prediction poses a significant and growing challenge in the medical field. While researchers have increasingly utilised machine learning (ML) algorithms to tackle this issue, supervised ML methods remain dominant. However, there is a rising interest in unsupervised techniques, especially in situations where data labels might be missing - as seen with undiagnosed or rare diseases. This study delves into comparing unsupervised ML models for disease prediction.MethodsThis study evaluated the efficacy of seven unsupervised algorithms on 15 datasets, including those of heart failure, diabetes, and breast cancer. It used six performance metrics for this comparison. They are Adjusted Rand Index, Adjusted Mutual Information, Homogeneity, Completeness, V-measure and Silhouette Coefficient.ResultsAmong the seven unsupervised ML methods, the DBSCAN (Density-based spatial clustering of applications with noise) showed the best performance most times (31), followed by the Bayesian Gaussian Mixture (18) and Divisive clustering (15). No single model consistently outshined others across every dataset and metric. The study emphasises the crucial role of model and performance measure selections based on application-specific needs. For example, DBSCAN excels in Homogeneity, Completeness and V-measure metrics. Conversely, the Bayesian Gaussian Mixture is good in the Adjusted R and Index metric. The codes used in this study can be found at https://github.com/haohuilu/unsupervisedml/.ConclusionThis research contributes deeper insights into the unsupervised ML applications in healthcare and encourages further investigations into model selection. Subsequent studies could harness genuine disease records for a more nuanced comparison and evaluation of models.
引用
收藏
页码:141 / 154
页数:14
相关论文
共 54 条
[1]   Analysis of Agglomerative Clustering [J].
Ackermann, Marcel R. ;
Bloemer, Johannes ;
Kuntze, Daniel ;
Sohler, Christian .
ALGORITHMICA, 2014, 69 (01) :184-215
[2]   The Application of Unsupervised Clustering Methods to Alzheimer's Disease [J].
Alashwal, Hany ;
El Halaby, Mohamed ;
Crouse, Jacob J. ;
Abdalla, Areeg ;
Moustafa, Ahmed A. .
FRONTIERS IN COMPUTATIONAL NEUROSCIENCE, 2019, 13
[3]  
Alloghani Mohamed., 2020, SUPERVISED UNSUPERVI, P3, DOI [10.1007/978-3-030-22475-2_1, https://doi.org/10.1007/978-3-030-22475-2_1, 10.1007/978-3-030-22475-2, 10.1007/, DOI 10.1007/978-3-030-22475-2]
[4]  
[Anonymous], 2010, P 19 INT C WORLD WID, DOI DOI 10.1145/1772690.1772862
[5]   A Comprehensive Unsupervised Framework for Chronic Kidney Disease Prediction [J].
Antony, Linta ;
Azam, Sami ;
Ignatious, Eva ;
Quadir, Ryana ;
Beeravolu, Abhijith Reddy ;
Jonkman, Mirjam ;
De Boer, Friso .
IEEE ACCESS, 2021, 9 :126481-126501
[6]  
Asuncion A., 2007, IRVINE
[7]   Model-based clustering of high-dimensional data: A review [J].
Bouveyron, Charles ;
Brunet-Saumard, Camille .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2014, 71 :52-78
[8]   A comparative study of efficient initialization methods for the k-means clustering algorithm [J].
Celebi, M. Emre ;
Kingravi, Hassan A. ;
Vela, Patricio A. .
EXPERT SYSTEMS WITH APPLICATIONS, 2013, 40 (01) :200-210
[9]  
Chander S., 2021, Artificial Intelligence Data Mining, V3, P41
[10]   A comparative study of automated legal text classification using random forests and deep learning [J].
Chen, Haihua ;
Wu, Lei ;
Chen, Jiangping ;
Lu, Wei ;
Ding, Junhua .
INFORMATION PROCESSING & MANAGEMENT, 2022, 59 (02)