Identification of Secondary Breast Cancer in Vital Organs through the Integration of Machine Learning and Microarrays

被引:0
作者
Riaz, Faisal [1 ]
Abid, Fazeel [1 ]
Din, Ikram Ud [2 ]
Kim, Byung-Seo [3 ]
Almogren, Ahmad [4 ]
Ul Durar, Shajara [5 ]
机构
[1] Univ Management & Technol, Dept Informat Syst, Lahore 54770, Pakistan
[2] Univ Haripur, Dept Informat Technol, Haripur 22620, Pakistan
[3] Hongik Univ, Dept Software & Commun Engn, Sejong 30016, South Korea
[4] King Saud Univ, Dept Comp Sci, Coll Comp & Informat Sci, Riyadh 11633, Saudi Arabia
[5] Univ Creat Arts, Management & Org Behav Business Sch, Epsom KT18 5BE, Surrey, England
基金
新加坡国家研究基金会;
关键词
metastasis; microarray; gene expression omnibus; decision trees; random forest; K-nearest neighbours; support vector machine; K-means SMOTE; GENE-EXPRESSION; METASTASES;
D O I
10.3390/electronics11121879
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Breast cancer includes genetic and environmental factors and is the most prevalent malignancy in women contributing to the pathogenesis and progression of cancer. Breast cancer prognosis metastasizes towards bones, the liver, brain, and lungs, and is the main cause of death in patients. Furthermore, the selection of features and classification is significant in microarray data analysis, which suffers from huge time consumption. To address these issues, this research uniquely integrates machine learning and microarrays to identify secondary breast cancer in vital organs. This work firstly imputes the missing values using K-nearest neighbors and improves the recursive feature elimination with cross-validation (RFECV) using the random forest method. Secondly, the class imbalance is handled by employing K-means synthetic object oversampling technique (SMOTE) to balance minority class and prevent noise. We successfully identified the 16 most essential Entrez gene ids responsible for predicting metastatic locations in the bones, brain, liver, and lungs. Extensive experiments are conducted on NCBI Gene Expression Omnibus GSE14020 and GSE54323 datasets. The proposed methods have handled class imbalance, prevented noise, and appropriately reduced time consumption. Reliable results were obtained on four classification models: decision tree; K-nearest neighbors; random forest; and support vector machine. Results are presented having considered confusion matrices, accuracy, ROC-AUC and PR-AUC, and F1-score.
引用
收藏
页数:36
相关论文
共 47 条
[1]  
Al-Salihy N.K., 2017, P 6 INT C SOFTW COMP, P144, DOI [10.1145/3056662.3056716, DOI 10.1145/3056662.3056716]
[2]  
Andreas CM, 2016, INTRO MACHINE LEARNI
[3]  
[Anonymous], SOFT GEO NCBI
[4]  
[Anonymous], 2017, AFFIMETRIX HUMAN GEN
[5]  
[Anonymous], GEOparse-GEOparse 1.2.0 Documentation
[6]  
[Anonymous], JPMA-Journal of Pakistan Medical Association
[7]  
[Anonymous], GSE54323-NCBI
[8]  
[Anonymous], GSE14020-NCBI
[9]  
Bonaccorso G., 2017, Mastering machine learning algorithms: Expert techniques for implementing popular machine learning algorithms, fine-tuning your models, and understanding how they work
[10]   Understanding patterns of brain metastasis in breast cancer and designing rational therapeutic strategies [J].
Brosnan, Evelyn M. ;
Anders, Carey K. .
ANNALS OF TRANSLATIONAL MEDICINE, 2018, 6 (09)