A hybrid stacking classifier with feature selection for handling imbalanced data

被引：0

作者：

Abraham A. ^{[1
]}

Kayalvizhi R. ^{[1
]}

Mohideen H.S. ^{[2
]}

机构：

[1] Department of Networking and Communications, School of Computing, SRM Institute of Science and Technology Kattankulathur, Chennai

[2] Department of Genetic Engineering, College of Engineering and Technology, SRM Institute of Science and Technology Kattankulathur, Chennai

来源：

Journal of Intelligent and Fuzzy Systems | 2024年 / 46卷 / 04期

关键词：

Machine learning; multi classification; Ovarian cancer; Pickle; Random Forest;

D O I：

10.3233/JIFS-236197

中图分类号：

学科分类号：

摘要：

Nowadays, cancer has become more alarming. This paper discusses the most significant Ovarian Cancer, Epithelial Ovarian Cancer (EOC), due to the low survival rate. The proposed algorithm for this work is a ‘Multi classifier ShapRFECV based EOC’ (MSRFECV-EOC) subtype analysis technique that utilized the EOC data from the National Centre for Biotechnology Information and Cancer Cell Line Encyclopedia websites for early identification of EOC using Machine Learning Techniques. This approach increases the data size, balances different classes of the data, and cuts down the enormous number of features unrelated to the disease of interest to prevent overfitting. To incorporate these functionalities, in the data preprocessing stage, OC-related gene names were taken from the Cancermine database and other OC-related works. Moreover, OC datasets were merged based on OC genes, and missing values of EOC subtypes were identified and imputed using Iterative Logistic Imputation. Synthetic Minority Oversampling Technique with an Edited Nearest Neighbors approach is applied to the imputed dataset. Next, in the Feature Selection phase, the most significant features for subtypes of EOC were identified by applying the Shapley Additive Explanations based on the Recursive Feature Elimination Cross-Validation (ShapRFECV) algorithm, preserving predefined features while selecting new EOC features. Eventually, an accuracy of 97% was achieved with Optuna-optimized Random Forest, which outperformed the existing models. SHAP plotted the most prominent features behind the classification. The Pickle tool saves much training time by preserving hidden parameter values of the model. In the final phase, by using the Stratified K Fold Stacking Classifier, the accuracy was improved to 98.9%. © 2024 – IOS Press. All rights reserved.

引用

页码：9103 / 9117

页数：14

共 50 条

[21] A Classification Method Based on Feature Selection for Imbalanced Data
Liu, Yi
Wang, Yanzhen
Ren, Xiaoguang
Zhou, Hao
Diao, Xingchun
IEEE ACCESS, 2019, 7 : 81794 - 81807
[22] Imbalanced Data Classification Based on Feature Selection Techniques
Ksieniewicz, Pawel
Wozniak, Michal
INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING (IDEAL 2018), PT II, 2018, 11315 : 296 - 303
[23] Feature Selection with High-Dimensional Imbalanced Data
Van Hulse, Jason
Khoshgoftaar, Taghi M.
Napolitano, Amri
Wald, Randall
2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, : 507 - 514
[24] Hybrid Undersampling and Oversampling for Handling Imbalanced Credit Card Data
Alamri, Maram
Ykhlef, Mourad
IEEE ACCESS, 2024, 12 : 14050 - 14060
[25] Classifier Selection for Highly Imbalanced Data Streams with Minority Driven Ensemble
Zyblewski, Pawel
Ksieniewicz, Pawel
Wozniak, Michal
ARTIFICIAL INTELLIGENCEAND SOFT COMPUTING, PT I, 2019, 11508 : 626 - 635
[26] A Proposed Framework on Hybrid Feature Selection Techniques for Handling High Dimensional Educational Data
Shahiri, Amirah Mohamed
Husain, Wahidah
Abd Rashid, Nur'Aini
2ND INTERNATIONAL CONFERENCE ON APPLIED SCIENCE AND TECHNOLOGY 2017 (ICAST'17), 2017, 1891
[27] MSFSS: A whale optimization-based multiple sampling feature selection stacking ensemble algorithm for classifying imbalanced data
Wang, Shuxiang
Shao, Changbin
Xu, Sen
Yang, Xibei
Yu, Hualong
AIMS MATHEMATICS, 2024, 9 (07): : 17504 - 17530
[28] Undersampling Instance Selection for Hybrid and Incomplete Imbalanced Data
Camacho-Nieto, Oscar
Yanez-Marquez, Cornelio
Villuendas-Rey, Yenny
JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2020, 26 (06) : 698 - 719
[29] Feature selection via minimizing global redundancy for imbalanced data
Huang, Shuhao
Chen, Hongmei
Li, Tianrui
Chen, Hao
Luo, Chuan
APPLIED INTELLIGENCE, 2022, 52 (08) : 8685 - 8707
[30] A feature selection method to handle imbalanced data in text classification
Chang, Fengxiang
Guo, Jun
Xu, Weiran
Yao, Kejun
Journal of Digital Information Management, 2015, 13 (03): : 169 - 175

← 1 2 3 4 5 →