Gene Expression-Based Cancer Classification for Handling the Class Imbalance Problem and Curse of Dimensionality

被引:4
作者
Al-Azani, Sadam [1 ]
Alkhnbashi, Omer S. [2 ]
Ramadan, Emad [2 ]
Alfarraj, Motaz [1 ,2 ,3 ]
机构
[1] King Fahd Univ Petr & Minerals KFUPM, SDAIA KFUPM Joint Res Ctr Artificial Intelligence, Dhahran 31261, Saudi Arabia
[2] King Fahd Univ Petr & Minerals KFUPM, Informat & Comp Sci Dept, Dhahran 31261, Saudi Arabia
[3] King Fahd Univ Petr & Minerals KFUPM, Elect Engn Dept, Dhahran 31261, Saudi Arabia
关键词
cancer detection and diagnosis; gene expression; feature selection; class imbalance; FEATURE-SELECTION; CLASSIFIERS; PREDICTION; DISCOVERY; DIAGNOSIS; ENSEMBLE; MACHINE; SMOTE; TUMOR;
D O I
10.3390/ijms25042102
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Cancer is a leading cause of death globally. The majority of cancer cases are only diagnosed in the late stages of cancer due to the use of conventional methods. This reduces the chance of survival for cancer patients. Therefore, early detection consequently followed by early diagnoses are important tasks in cancer research. Gene expression microarray technology has been applied to detect and diagnose most types of cancers in their early stages and has gained encouraging results. In this paper, we address the problem of classifying cancer based on gene expression for handling the class imbalance problem and the curse of dimensionality. The oversampling technique is utilized to overcome this problem by adding synthetic samples. Another common issue related to the gene expression dataset addressed in this paper is the curse of dimensionality. This problem is addressed by applying chi-square and information gain feature selection techniques. After applying these techniques individually, we proposed a method to select the most significant genes by combining those two techniques (CHiS and IG). We investigated the effect of these techniques individually and in combination. Four benchmarking biomedical datasets (Leukemia-subtypes, Leukemia-ALLAML, Colon, and CuMiDa) were used. The experimental results reveal that the oversampling techniques improve the results in most cases. Additionally, the performance of the proposed feature selection technique outperforms individual techniques in nearly all cases. In addition, this study provides an empirical study for evaluating several oversampling techniques along with ensemble-based learning. The experimental results also reveal that SVM-SMOTE, along with the random forests classifier, achieved the highest results, with a reporting accuracy of 100%. The obtained results surpass the findings in the existing literature as well.
引用
收藏
页数:17
相关论文
共 43 条
[1]  
Ahmed A., 2023, J. Biol. Eng, V17
[2]   Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J].
Alon, U ;
Barkai, N ;
Notterman, DA ;
Gish, K ;
Ybarra, S ;
Mack, D ;
Levine, AJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) :6745-6750
[3]   A Hamming distance based binary particle swarm optimization (HDBPSO) algorithm for high dimensional feature selection, classification and validation [J].
Banka, Haider ;
Dara, Suresh .
PATTERN RECOGNITION LETTERS, 2015, 52 :94-100
[4]   A comprehensive survey on computational learning methods for analysis of gene expression data [J].
Bhandari, Nikita ;
Walambe, Rahee ;
Kotecha, Ketan ;
Khare, Satyajeet P. .
FRONTIERS IN MOLECULAR BIOSCIENCES, 2022, 9
[5]  
Bogdanova A.M., 2013, P 6 INT C INFORM TEC
[6]  
Bouazza SH, 2015, 2015 INTELLIGENT SYSTEMS AND COMPUTER VISION (ISCV)
[7]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[8]   A fast gene selection method for multi-cancer classification using multiple support vector data description [J].
Cao, Jin ;
Zhang, Li ;
Wang, Bangjun ;
Li, Fanzhang ;
Yang, Jiwen .
JOURNAL OF BIOMEDICAL INFORMATICS, 2015, 53 :381-389
[9]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[10]   Empirical study on imbalanced learning of Arabic sentiment polarity with neural word embedding [J].
El-Alfy, El-Sayed M. ;
Al-Azani, Sadam .
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 38 (05) :6211-6222