A Hybrid Machine Learning Approach to Screen Optimal Predictors for the Classification of Primary Breast Tumors from Gene Expression Microarray Data

被引:5
|
作者
Alromema, Nashwan [1 ]
Syed, Asif Hassan [1 ]
Khan, Tabrej [2 ]
机构
[1] King Abdulaziz Univ, Fac Comp & Informat Technol Rabigh FCITR, Dept Comp Sci, Jeddah 22254, Saudi Arabia
[2] King Abdulaziz Univ, Fac Comp & Informat Technol Rabigh FCITR, Dept Informat Syst, Jeddah 22254, Saudi Arabia
关键词
primary breast tumor; gene-biomarkers; hybrid-feature selection approach; filter-based fs; two-tailed unpaired t-test; meta-heuristics techniques; supervised machine learning classifiers; breast tumor prediction; FEATURE-SELECTION ALGORITHM; CANCER; PROTEIN; MAPK; OPTIMIZATION; BIOMARKER; RISK; ENAH;
D O I
10.3390/diagnostics13040708
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
The high dimensionality and sparsity of the microarray gene expression data make it challenging to analyze and screen the optimal subset of genes as predictors of breast cancer (BC). The authors in the present study propose a novel hybrid Feature Selection (FS) sequential framework involving minimum Redundancy-Maximum Relevance (mRMR), a two-tailed unpaired t-test, and meta-heuristics to screen the most optimal set of gene biomarkers as predictors for BC. The proposed framework identified a set of three most optimal gene biomarkers, namely, MAPK 1, APOBEC3B, and ENAH. In addition, the state-of-the-art supervised Machine Learning (ML) algorithms, namely Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Neural Net (NN), Naive Bayes (NB), Decision Tree (DT), eXtreme Gradient Boosting (XGBoost), and Logistic Regression (LR) were used to test the predictive capability of the selected gene biomarkers and select the most effective breast cancer diagnostic model with higher values of performance matrices. Our study found that the XGBoost-based model was the superior performer with an accuracy of 0.976 +/- 0.027, an F1-Score of 0.974 +/- 0.030, and an AUC value of 0.961 +/- 0.035 when tested on an independent test dataset. The screened gene biomarkers-based classification system efficiently detects primary breast tumors from normal breast samples.
引用
收藏
页数:31
相关论文
共 22 条