Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis

被引：3

作者：

Gurcan, Fatih ^{[1
]}

Soylu, Ahmet ^{[2
]}

机构：

[1] Karadeniz Tech Univ, Fac Econ & Adm Sci, Dept Management Informat Syst, TR-61080 Trabzon, Turkiye

[2] Norwegian Univ Sci & Technol, Fac Informat Technol & Elect Engn, Dept Comp Sci, N-2815 Gjovik, Norway

来源：

CANCERS | 2024年 / 16卷 / 19期

关键词：

cancer diagnosis and prognosis; class imbalance; machine learning; resampling techniques; random forest; predictive modeling; MULTICLASS;

D O I：

10.3390/cancers16193417

中图分类号：

R73 [肿瘤学];

学科分类号：

100214 ;

摘要：

Simple Summary This research focuses on improving cancer diagnosis and prognosis by addressing a common problem in data analysis known as class imbalance, where some patient groups are underrepresented. The authors aim to evaluate different resampling methods that can balance the data and enhance the performance of various classification algorithms used to predict cancer outcomes. By testing a wide range of techniques across multiple cancer datasets, this study identifies the best-performing classifier, Random Forest, along with the most effective resampling method, SMOTEENN. These findings provide valuable insights for researchers and healthcare professionals, enabling them to make more accurate predictions and ultimately improve patient care. This research could pave the way for the development of more reliable machine learning applications in the medical field.Abstract Background/Objectives: This study aims to evaluate the performance of various classification algorithms and resampling methods across multiple diagnostic and prognostic cancer datasets, addressing the challenges of class imbalance. Methods: A total of five datasets were analyzed, including three diagnostic datasets (Wisconsin Breast Cancer Database, Cancer Prediction Dataset, Lung Cancer Detection Dataset) and two prognostic datasets (Seer Breast Cancer Dataset, Differentiated Thyroid Cancer Recurrence Dataset). Nineteen resampling methods from three categories were employed, and ten classifiers from four distinct categories were utilized for comparison. Results: The results demonstrated that hybrid sampling methods, particularly SMOTEENN, achieved the highest mean performance at 98.19%, followed by IHT (97.20%) and RENN (96.48%). In terms of classifiers, Random Forest showed the best performance with a mean value of 94.69%, with Balanced Random Forest and XGBoost following closely. The baseline method (no resampling) yielded a significantly lower performance of 91.33%, highlighting the effectiveness of resampling techniques in improving model outcomes. Conclusions: This research underscores the importance of resampling methods in enhancing classification performance on imbalanced datasets, providing valuable insights for researchers and healthcare professionals. The findings serve as a foundation for future studies aimed at integrating machine learning techniques in cancer diagnosis and prognosis, with recommendations for further research on hybrid models and clinical applications.

引用

页数：19

共 50 条

[31] Evaluating the Performance of Machine Learning Techniques for Cancer Detection and Diagnosis
Sebastian, Anu Maria
Peter, David
INNOVATIVE DATA COMMUNICATION TECHNOLOGIES AND APPLICATION, 2020, 46 : 127 - 133
[32] Cardiovascular disease diagnosis: a holistic approach using the integration of machine learning and deep learning models
Sadr, Hossein
Salari, Arsalan
Ashoobi, Mohammad Taghi
Nazari, Mojdeh
EUROPEAN JOURNAL OF MEDICAL RESEARCH, 2024, 29 (01) : 455
[33] Interpretable machine learning models for failure cause prediction in imbalanced oil pipeline data
Awuku, Bright
Huang, Ying
Yodo, Nita
Asa, Eric
MEASUREMENT SCIENCE AND TECHNOLOGY, 2024, 35 (07)
[34] Development and validation of machine learning models for diagnosis and prognosis of cancer by urinary proteomics, based on the FLEMENGHO cohort
Wang, Shuncong
Wei, Dongmei
Zhao, Yanling
Pang, Xin
Zhang, Zhenyu
AMERICAN JOURNAL OF CANCER RESEARCH, 2024, 14 (02):
[35] Role of Artificial Intelligence and Machine Learning in Prediction, Diagnosis, and Prognosis of Cancer
Gaur, Kritika
Jagtap, Miheer M.
CUREUS JOURNAL OF MEDICAL SCIENCE, 2022, 14 (11)
[36] Knowledge-Informed Machine Learning for Cancer Diagnosis and Prognosis: A Review
Mao, Lingchao
Wang, Hairong
Hu, Leland S.
Tran, Nhan L.
Canoll, Peter D.
Swanson, Kristin R.
Li, Jing
IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2024,
[37] Machine Learning in Diagnosis and Prognosis of Lung Cancer by PET-CT
Yuan, Lili
An, Lin
Zhu, Yandong
Duan, Chongling
Kong, Weixiang
Jiang, Pei
Yu, Qing-Qing
CANCER MANAGEMENT AND RESEARCH, 2024, 16 : 361 - 375
[38] Evaluating Advanced Machine Learning Techniques for Pulsar Detection from HTRU Survey
Punia, Akhil
Sardana, Ashish
Subashini, Monica
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INTELLIGENT SUSTAINABLE SYSTEMS (ICISS 2017), 2017, : 470 - 474
[39] Using Advanced Machine Learning Models for Detection of Dyslexia Among Children By Parents: A Study from Screening to Diagnosis
Alrubaian, Abdullah
ASSESSMENT, 2025,
[40] Application of Machine Learning Models for Survival Prognosis in Breast Cancer Studies
Mihaylov, Iliyan
Nisheva, Maria
Vassilev, Dimitar
INFORMATION, 2019, 10 (03)

← 1 2 3 4 5 →