Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis

被引:3
作者
Gurcan, Fatih [1 ]
Soylu, Ahmet [2 ]
机构
[1] Karadeniz Tech Univ, Fac Econ & Adm Sci, Dept Management Informat Syst, TR-61080 Trabzon, Turkiye
[2] Norwegian Univ Sci & Technol, Fac Informat Technol & Elect Engn, Dept Comp Sci, N-2815 Gjovik, Norway
关键词
cancer diagnosis and prognosis; class imbalance; machine learning; resampling techniques; random forest; predictive modeling; MULTICLASS;
D O I
10.3390/cancers16193417
中图分类号
R73 [肿瘤学];
学科分类号
100214 ;
摘要
Simple Summary This research focuses on improving cancer diagnosis and prognosis by addressing a common problem in data analysis known as class imbalance, where some patient groups are underrepresented. The authors aim to evaluate different resampling methods that can balance the data and enhance the performance of various classification algorithms used to predict cancer outcomes. By testing a wide range of techniques across multiple cancer datasets, this study identifies the best-performing classifier, Random Forest, along with the most effective resampling method, SMOTEENN. These findings provide valuable insights for researchers and healthcare professionals, enabling them to make more accurate predictions and ultimately improve patient care. This research could pave the way for the development of more reliable machine learning applications in the medical field.Abstract Background/Objectives: This study aims to evaluate the performance of various classification algorithms and resampling methods across multiple diagnostic and prognostic cancer datasets, addressing the challenges of class imbalance. Methods: A total of five datasets were analyzed, including three diagnostic datasets (Wisconsin Breast Cancer Database, Cancer Prediction Dataset, Lung Cancer Detection Dataset) and two prognostic datasets (Seer Breast Cancer Dataset, Differentiated Thyroid Cancer Recurrence Dataset). Nineteen resampling methods from three categories were employed, and ten classifiers from four distinct categories were utilized for comparison. Results: The results demonstrated that hybrid sampling methods, particularly SMOTEENN, achieved the highest mean performance at 98.19%, followed by IHT (97.20%) and RENN (96.48%). In terms of classifiers, Random Forest showed the best performance with a mean value of 94.69%, with Balanced Random Forest and XGBoost following closely. The baseline method (no resampling) yielded a significantly lower performance of 91.33%, highlighting the effectiveness of resampling techniques in improving model outcomes. Conclusions: This research underscores the importance of resampling methods in enhancing classification performance on imbalanced datasets, providing valuable insights for researchers and healthcare professionals. The findings serve as a foundation for future studies aimed at integrating machine learning techniques in cancer diagnosis and prognosis, with recommendations for further research on hybrid models and clinical applications.
引用
收藏
页数:19
相关论文
共 50 条
  • [41] Enhanced Data Processing and Machine Learning Techniques for Energy Consumption Forecasting
    Shin, Jihye
    Moon, Hyeonjoon
    Chun, Chang-Jae
    Sim, Taeyong
    Kim, Eunhee
    Lee, Sujin
    [J]. ELECTRONICS, 2024, 13 (19)
  • [42] Handling highly imbalanced data for classifying fatality of auto collisions using machine learning techniques
    Xie, Shengkun
    Zhang, Jin
    [J]. JOURNAL OF MANAGEMENT ANALYTICS, 2024, 11 (03) : 317 - 357
  • [43] Enhanced Agricultural Monitoring through Hyperspectral Imaging and Advanced Machine Learning Techniques
    Kapileswar, Nellore
    Simon, Judy
    Sirisha, Kota
    Pujitha, Bezawada Raja
    Kumar, Lekkala Charan Sai
    Harish, Chappagadda
    [J]. 2024 SECOND INTERNATIONAL CONFERENCE ON INTELLIGENT CYBER PHYSICAL SYSTEMS AND INTERNET OF THINGS, ICOICI 2024, 2024, : 1495 - 1502
  • [44] Review on Machine Learning Techniques for Medical Data Classification and Disease Diagnosis
    Saturi, Swapna
    [J]. REGENERATIVE ENGINEERING AND TRANSLATIONAL MEDICINE, 2023, 9 (02) : 141 - 164
  • [45] Review on Machine Learning Techniques for Medical Data Classification and Disease Diagnosis
    Swapna Saturi
    [J]. Regenerative Engineering and Translational Medicine, 2023, 9 : 141 - 164
  • [46] Lung cancer prediction using machine learning and advanced imaging techniques
    Kadir, Timor
    Gleeson, Fergus
    [J]. TRANSLATIONAL LUNG CANCER RESEARCH, 2018, 7 (03) : 304 - 312
  • [47] The effect of resampling techniques on the performances of machine learning clinical risk prediction models in the setting of severe class imbalance: development and internal validation in a retrospective cohort
    Ke, Janny Xue Chen
    DhakshinaMurthy, Arunachalam
    George, Ronald B.
    Branco, Paula
    [J]. Discover Artificial Intelligence, 2024, 4 (01):
  • [48] Machine Learning techniques for Prediction from various Breast Cancer Datasets
    Shalini, M.
    Radhika, S.
    [J]. 2020 SIXTH INTERNATIONAL CONFERENCE ON BIO SIGNALS, IMAGES, AND INSTRUMENTATION (ICBSII), 2020,
  • [49] Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy
    Varotto, Giulia
    Susi, Gianluca
    Tassi, Laura
    Gozzo, Francesca
    Franceschetti, Silvana
    Panzica, Ferruccio
    [J]. FRONTIERS IN NEUROINFORMATICS, 2021, 15
  • [50] Diagnosis of pes planus from X-ray images: Enhanced feature selection with deep learning and machine learning techniques
    Danaci, Cagla
    Avci, Derya
    Tuncer, Seda Arslan
    [J]. BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2025, 106