Identifying Diagnostic Biomarkers of Breast Cancer Based on Gene Expression Data and Ensemble Feature Selection

被引:6
作者
Li, Lingyu [1 ]
Algabri, Yousif A. [1 ]
Liu, Zhi-Ping [1 ]
机构
[1] Shandong Univ, Sch Control Sci & Engn, Dept Biomed Engn, Jinan 250061, Shandong, Peoples R China
基金
中国国家自然科学基金;
关键词
Biomarker; machine learning; ensemble feature selection; gene expression data; breast cancer; early detection; IDENTIFICATION; DISCOVERY; VARIABLES;
D O I
10.2174/1574893618666230111153243
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: In recent years, the identification of biomarkers or signatures based on gene expression profiling data has attracted much attention in bioinformatics. The successful discovery of breast cancer (BRCA) biomarkers will be beneficial in reducing the risk of BRCA among patients for early detection.Methods: This paper proposes an Ensemble Feature Selection method to screen biomarkers (abbreviated as EFSmarker) for BRCA from publically available gene expression data. Firstly, we employ twelve filter feature selection methods, namely median, variance, Chi-square, Relief, Pearson and Spearman correlation, mutual information, minimal-redundancy-maximal-relevance criterion, ridge regression, decision tree and random forest with Gini index and accuracy index, to calculate the importance (weights or coefficients) of all features on the training dataset. Secondly, we apply the logistic regression classifier on the test dataset to calculate the classification AUC value of each feature subset individually selected by twelve methods. Thirdly, we provide an ensemble feature selection method by aggregating feature importance with classification AUC value. In particular, we establish a feature importance score (FIS) to evaluate the importance of each feature underlying all feature selection methods. Finally, the features with higher FIS are taken as identified biomarkers.Results: With the direction of the FIS index induced by the EFSmarker method, 12 genes (COL10A1, COL11A1, MMP11, LOC728264, FIGF, GJB2, INHBA, CD300LG, IGFBP6, PAMR1, CXCL2 and FXYD1) are regarded as diagnostic biomarkers for BRCA. Especially, COL10A1, ranked first with a FIS value of 0.663, is identified as the most credible biomarker. The findings justified via gene and protein expression validation, functional enrichment analysis, literature checking and independent dataset validation verify the effectiveness and efficiency of these selected biomarkers.Conclusion: Our proposed biomarker discovery strategy not only utilizes the feature contribution but also considers the prediction accuracy simultaneously, which may also serve as a model for identifying unknown biomarkers for other diseases from high-throughput gene expression data. The source code and data are available at .
引用
收藏
页码:232 / 246
页数:15
相关论文
共 42 条
[1]   Robust biomarker identification for cancer diagnosis with ensemble feature selection methods [J].
Abeel, Thomas ;
Helleputte, Thibault ;
Van de Peer, Yves ;
Dupont, Pierre ;
Saeys, Yvan .
BIOINFORMATICS, 2010, 26 (03) :392-398
[2]  
Awada W, 2012, 2012 IEEE 13TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), P356, DOI 10.1109/IRI.2012.6303031
[3]   Ensemble feature selection for high dimensional data: a new method and a comparative study [J].
Ben Brahim, Afef ;
Limam, Mohamed .
ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2018, 12 (04) :937-952
[4]   The use of the area under the roc curve in the evaluation of machine learning algorithms [J].
Bradley, AP .
PATTERN RECOGNITION, 1997, 30 (07) :1145-1159
[5]   70-Gene Signature as an Aid to Treatment Decisions in Early-Stage Breast Cancer [J].
Cardoso, F. ;
van't Veer, L. J. ;
Bogaerts, J. ;
Slaets, L. ;
Viale, G. ;
Delaloge, S. ;
Pierga, J. -Y. ;
Brain, E. ;
Causeret, S. ;
DeLorenzi, M. ;
Glas, A. M. ;
Golfinopoulos, V. ;
Goulioti, T. ;
Knox, S. ;
Matos, E. ;
Meulemans, B. ;
Neijenhuis, P. A. ;
Nitz, U. ;
Passalacqua, R. ;
Ravdin, P. ;
Rubio, I. T. ;
Saghatchian, M. ;
Smilde, T. J. ;
Sotiriou, C. ;
Stork, L. ;
Straehle, C. ;
Thomas, G. ;
Thompson, A. M. ;
van der Hoeven, J. M. ;
Vuylsteke, P. ;
Bernards, R. ;
Tryfonidis, K. ;
Rutgers, E. ;
Piccart, M. .
NEW ENGLAND JOURNAL OF MEDICINE, 2016, 375 (08) :717-729
[6]   Integrating ensemble systems biology feature selection and bimodal deep neural network for breast cancer prognosis prediction [J].
Cheng, Li-Hsin ;
Hsu, Te-Cheng ;
Lin, Che .
SCIENTIFIC REPORTS, 2021, 11 (01)
[7]   A new hybrid ensemble feature selection framework for machine learning-based phishing detection system [J].
Chiew, Kang Leng ;
Tan, Choon Lin ;
Wong, KokSheik ;
Yong, Kelvin S. C. ;
Tiong, Wei King .
INFORMATION SCIENCES, 2019, 484 :153-166
[8]   Radiosensitivity index emerges as a potential biomarker for combined radiotherapy and immunotherapy [J].
Dai, Yang-Hong ;
Wang, Ying-Fu ;
Shen, Po-Chien ;
Lo, Cheng-Hsiang ;
Yang, Jen-Fu ;
Lin, Chun-Shu ;
Chao, Hsing-Lung ;
Huang, Wen-Yen .
NPJ GENOMIC MEDICINE, 2021, 6 (01)
[9]   mRMRe: an R package for parallelized mRMR ensemble feature selection [J].
De Jay, Nicolas ;
Papillon-Cavanagh, Simon ;
Olsen, Catharina ;
El-Hachem, Nehme ;
Bontempi, Gianluca ;
Haibe-Kains, Benjamin .
BIOINFORMATICS, 2013, 29 (18) :2365-2368
[10]   Comparing Two New Gene Selection Ensemble Approaches with the Commonly-used Approach [J].
Dittman, David J. ;
Khoshgoftaar, Taghi M. ;
Wald, Randall ;
Napolitano, Amri .
2012 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2012), VOL 2, 2012, :184-191