Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis

被引:3
作者
Gurcan, Fatih [1 ]
Soylu, Ahmet [2 ]
机构
[1] Karadeniz Tech Univ, Fac Econ & Adm Sci, Dept Management Informat Syst, TR-61080 Trabzon, Turkiye
[2] Norwegian Univ Sci & Technol, Fac Informat Technol & Elect Engn, Dept Comp Sci, N-2815 Gjovik, Norway
关键词
cancer diagnosis and prognosis; class imbalance; machine learning; resampling techniques; random forest; predictive modeling; MULTICLASS;
D O I
10.3390/cancers16193417
中图分类号
R73 [肿瘤学];
学科分类号
100214 ;
摘要
Simple Summary This research focuses on improving cancer diagnosis and prognosis by addressing a common problem in data analysis known as class imbalance, where some patient groups are underrepresented. The authors aim to evaluate different resampling methods that can balance the data and enhance the performance of various classification algorithms used to predict cancer outcomes. By testing a wide range of techniques across multiple cancer datasets, this study identifies the best-performing classifier, Random Forest, along with the most effective resampling method, SMOTEENN. These findings provide valuable insights for researchers and healthcare professionals, enabling them to make more accurate predictions and ultimately improve patient care. This research could pave the way for the development of more reliable machine learning applications in the medical field.Abstract Background/Objectives: This study aims to evaluate the performance of various classification algorithms and resampling methods across multiple diagnostic and prognostic cancer datasets, addressing the challenges of class imbalance. Methods: A total of five datasets were analyzed, including three diagnostic datasets (Wisconsin Breast Cancer Database, Cancer Prediction Dataset, Lung Cancer Detection Dataset) and two prognostic datasets (Seer Breast Cancer Dataset, Differentiated Thyroid Cancer Recurrence Dataset). Nineteen resampling methods from three categories were employed, and ten classifiers from four distinct categories were utilized for comparison. Results: The results demonstrated that hybrid sampling methods, particularly SMOTEENN, achieved the highest mean performance at 98.19%, followed by IHT (97.20%) and RENN (96.48%). In terms of classifiers, Random Forest showed the best performance with a mean value of 94.69%, with Balanced Random Forest and XGBoost following closely. The baseline method (no resampling) yielded a significantly lower performance of 91.33%, highlighting the effectiveness of resampling techniques in improving model outcomes. Conclusions: This research underscores the importance of resampling methods in enhancing classification performance on imbalanced datasets, providing valuable insights for researchers and healthcare professionals. The findings serve as a foundation for future studies aimed at integrating machine learning techniques in cancer diagnosis and prognosis, with recommendations for further research on hybrid models and clinical applications.
引用
收藏
页数:19
相关论文
共 50 条
  • [31] Evaluating the Performance of Machine Learning Techniques for Cancer Detection and Diagnosis
    Sebastian, Anu Maria
    Peter, David
    INNOVATIVE DATA COMMUNICATION TECHNOLOGIES AND APPLICATION, 2020, 46 : 127 - 133
  • [32] Cardiovascular disease diagnosis: a holistic approach using the integration of machine learning and deep learning models
    Sadr, Hossein
    Salari, Arsalan
    Ashoobi, Mohammad Taghi
    Nazari, Mojdeh
    EUROPEAN JOURNAL OF MEDICAL RESEARCH, 2024, 29 (01) : 455
  • [33] Interpretable machine learning models for failure cause prediction in imbalanced oil pipeline data
    Awuku, Bright
    Huang, Ying
    Yodo, Nita
    Asa, Eric
    MEASUREMENT SCIENCE AND TECHNOLOGY, 2024, 35 (07)
  • [34] Development and validation of machine learning models for diagnosis and prognosis of cancer by urinary proteomics, based on the FLEMENGHO cohort
    Wang, Shuncong
    Wei, Dongmei
    Zhao, Yanling
    Pang, Xin
    Zhang, Zhenyu
    AMERICAN JOURNAL OF CANCER RESEARCH, 2024, 14 (02):
  • [35] Role of Artificial Intelligence and Machine Learning in Prediction, Diagnosis, and Prognosis of Cancer
    Gaur, Kritika
    Jagtap, Miheer M.
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2022, 14 (11)
  • [36] Knowledge-Informed Machine Learning for Cancer Diagnosis and Prognosis: A Review
    Mao, Lingchao
    Wang, Hairong
    Hu, Leland S.
    Tran, Nhan L.
    Canoll, Peter D.
    Swanson, Kristin R.
    Li, Jing
    IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2024,
  • [37] Machine Learning in Diagnosis and Prognosis of Lung Cancer by PET-CT
    Yuan, Lili
    An, Lin
    Zhu, Yandong
    Duan, Chongling
    Kong, Weixiang
    Jiang, Pei
    Yu, Qing-Qing
    CANCER MANAGEMENT AND RESEARCH, 2024, 16 : 361 - 375
  • [38] Evaluating Advanced Machine Learning Techniques for Pulsar Detection from HTRU Survey
    Punia, Akhil
    Sardana, Ashish
    Subashini, Monica
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INTELLIGENT SUSTAINABLE SYSTEMS (ICISS 2017), 2017, : 470 - 474
  • [40] Application of Machine Learning Models for Survival Prognosis in Breast Cancer Studies
    Mihaylov, Iliyan
    Nisheva, Maria
    Vassilev, Dimitar
    INFORMATION, 2019, 10 (03)