A hybrid stacking classifier with feature selection for handling imbalanced data

被引:0
|
作者
Abraham A. [1 ]
Kayalvizhi R. [1 ]
Mohideen H.S. [2 ]
机构
[1] Department of Networking and Communications, School of Computing, SRM Institute of Science and Technology Kattankulathur, Chennai
[2] Department of Genetic Engineering, College of Engineering and Technology, SRM Institute of Science and Technology Kattankulathur, Chennai
来源
关键词
Machine learning; multi classification; Ovarian cancer; Pickle; Random Forest;
D O I
10.3233/JIFS-236197
中图分类号
学科分类号
摘要
Nowadays, cancer has become more alarming. This paper discusses the most significant Ovarian Cancer, Epithelial Ovarian Cancer (EOC), due to the low survival rate. The proposed algorithm for this work is a ‘Multi classifier ShapRFECV based EOC’ (MSRFECV-EOC) subtype analysis technique that utilized the EOC data from the National Centre for Biotechnology Information and Cancer Cell Line Encyclopedia websites for early identification of EOC using Machine Learning Techniques. This approach increases the data size, balances different classes of the data, and cuts down the enormous number of features unrelated to the disease of interest to prevent overfitting. To incorporate these functionalities, in the data preprocessing stage, OC-related gene names were taken from the Cancermine database and other OC-related works. Moreover, OC datasets were merged based on OC genes, and missing values of EOC subtypes were identified and imputed using Iterative Logistic Imputation. Synthetic Minority Oversampling Technique with an Edited Nearest Neighbors approach is applied to the imputed dataset. Next, in the Feature Selection phase, the most significant features for subtypes of EOC were identified by applying the Shapley Additive Explanations based on the Recursive Feature Elimination Cross-Validation (ShapRFECV) algorithm, preserving predefined features while selecting new EOC features. Eventually, an accuracy of 97% was achieved with Optuna-optimized Random Forest, which outperformed the existing models. SHAP plotted the most prominent features behind the classification. The Pickle tool saves much training time by preserving hidden parameter values of the model. In the final phase, by using the Stratified K Fold Stacking Classifier, the accuracy was improved to 98.9%. © 2024 – IOS Press. All rights reserved.
引用
收藏
页码:9103 / 9117
页数:14
相关论文
共 50 条
  • [21] A Classification Method Based on Feature Selection for Imbalanced Data
    Liu, Yi
    Wang, Yanzhen
    Ren, Xiaoguang
    Zhou, Hao
    Diao, Xingchun
    IEEE ACCESS, 2019, 7 : 81794 - 81807
  • [22] Imbalanced Data Classification Based on Feature Selection Techniques
    Ksieniewicz, Pawel
    Wozniak, Michal
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING (IDEAL 2018), PT II, 2018, 11315 : 296 - 303
  • [23] Feature Selection with High-Dimensional Imbalanced Data
    Van Hulse, Jason
    Khoshgoftaar, Taghi M.
    Napolitano, Amri
    Wald, Randall
    2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, : 507 - 514
  • [24] Hybrid Undersampling and Oversampling for Handling Imbalanced Credit Card Data
    Alamri, Maram
    Ykhlef, Mourad
    IEEE ACCESS, 2024, 12 : 14050 - 14060
  • [25] Classifier Selection for Highly Imbalanced Data Streams with Minority Driven Ensemble
    Zyblewski, Pawel
    Ksieniewicz, Pawel
    Wozniak, Michal
    ARTIFICIAL INTELLIGENCEAND SOFT COMPUTING, PT I, 2019, 11508 : 626 - 635
  • [26] A Proposed Framework on Hybrid Feature Selection Techniques for Handling High Dimensional Educational Data
    Shahiri, Amirah Mohamed
    Husain, Wahidah
    Abd Rashid, Nur'Aini
    2ND INTERNATIONAL CONFERENCE ON APPLIED SCIENCE AND TECHNOLOGY 2017 (ICAST'17), 2017, 1891
  • [27] MSFSS: A whale optimization-based multiple sampling feature selection stacking ensemble algorithm for classifying imbalanced data
    Wang, Shuxiang
    Shao, Changbin
    Xu, Sen
    Yang, Xibei
    Yu, Hualong
    AIMS MATHEMATICS, 2024, 9 (07): : 17504 - 17530
  • [28] Undersampling Instance Selection for Hybrid and Incomplete Imbalanced Data
    Camacho-Nieto, Oscar
    Yanez-Marquez, Cornelio
    Villuendas-Rey, Yenny
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2020, 26 (06) : 698 - 719
  • [29] Feature selection via minimizing global redundancy for imbalanced data
    Huang, Shuhao
    Chen, Hongmei
    Li, Tianrui
    Chen, Hao
    Luo, Chuan
    APPLIED INTELLIGENCE, 2022, 52 (08) : 8685 - 8707
  • [30] A feature selection method to handle imbalanced data in text classification
    Chang, Fengxiang
    Guo, Jun
    Xu, Weiran
    Yao, Kejun
    Journal of Digital Information Management, 2015, 13 (03): : 169 - 175