Machine Learning-Based Identification of Colon Cancer Candidate Diagnostics Genes

被引:33
作者
Koppad, Saraswati [1 ]
Basava, Annappa [1 ]
Nash, Katrina [2 ]
Gkoutos, Georgios, V [3 ,4 ,5 ,6 ,7 ,8 ]
Acharjee, Animesh [3 ,4 ,5 ]
机构
[1] Natl Inst Technol Karnataka, Dept Comp Sci & Engn, Mangalore 575025, India
[2] Univ Birmingham, Coll Med & Dent Sci, Birmingham B15 2TT, W Midlands, England
[3] Univ Birmingham, Inst Canc & Genom Sci, Birmingham B15 2TT, W Midlands, England
[4] Univ Birmingham, Inst Translat Med, Birmingham B15 2TT, W Midlands, England
[5] Univ Hosp Birmingham, NIHR Surg Reconstruct & Microbiol Res Ctr, Birmingham B15 2WB, W Midlands, England
[6] MRC Hlth Data Res UK HDR UK, Midlands Site, Birmingham B15 2TT, W Midlands, England
[7] NIHR Expt Canc Med Ctr, Birmingham B15 2TT, W Midlands, England
[8] Univ Hosp Birmingham, NIHR Biomed Res Ctr, Birmingham B15 2TT, W Midlands, England
来源
BIOLOGY-BASEL | 2022年 / 11卷 / 03期
基金
英国科研创新办公室;
关键词
biomarker identification; transcriptomics; machine learning; prediction; variable selection; COLORECTAL-CANCER; LOGISTIC-REGRESSION; EXPRESSION; BIOMARKER; DISEASE; MODELS; PERFORMANCE; PROTEOMICS; DISCOVERY; CLAUDIN-1;
D O I
10.3390/biology11030365
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Simple Summary We developed a predictive approach using different machine learning methods to identify a number of genes that can potentially serve as novel diagnostic colon cancer biomarkers. Background: Colorectal cancer (CRC) is the third leading cause of cancer-related death and the fourth most commonly diagnosed cancer worldwide. Due to a lack of diagnostic biomarkers and understanding of the underlying molecular mechanisms, CRC's mortality rate continues to grow. CRC occurrence and progression are dynamic processes. The expression levels of specific molecules vary at various stages of CRC, rendering its early detection and diagnosis challenging and the need for identifying accurate and meaningful CRC biomarkers more pressing. The advances in high-throughput sequencing technologies have been used to explore novel gene expression, targeted treatments, and colon cancer pathogenesis. Such approaches are routinely being applied and result in large datasets whose analysis is increasingly becoming dependent on machine learning (ML) algorithms that have been demonstrated to be computationally efficient platforms for the identification of variables across such high-dimensional datasets. Methods: We developed a novel ML-based experimental design to study CRC gene associations. Six different machine learning methods were employed as classifiers to identify genes that can be used as diagnostics for CRC using gene expression and clinical datasets. The accuracy, sensitivity, specificity, F1 score, and area under receiver operating characteristic (AUROC) curve were derived to explore the differentially expressed genes (DEGs) for CRC diagnosis. Gene ontology enrichment analyses of these DEGs were performed and predicted gene signatures were linked with miRNAs. Results: We evaluated six machine learning classification methods (Adaboost, ExtraTrees, logistic regression, naive Bayes classifier, random forest, and XGBoost) across different combinations of training and test datasets over GEO datasets. The accuracy and the AUROC of each combination of training and test data with different algorithms were used as comparison metrics. Random forest (RF) models consistently performed better than other models. In total, 34 genes were identified and used for pathway and gene set enrichment analysis. Further mapping of the 34 genes with miRNA identified interesting miRNA hubs genes. Conclusions: We identified 34 genes with high accuracy that can be used as a diagnostics panel for CRC.
引用
收藏
页数:15
相关论文
共 81 条
[1]   A random forest based biomarker discovery and power analysis framework for diagnostics research [J].
Acharjee, Animesh ;
Larkman, Joseph ;
Xu, Yuanwei ;
Cardoso, Victor Roth ;
Gkoutos, Georgios V. .
BMC MEDICAL GENOMICS, 2020, 13 (01)
[2]   Integration of metabolomics, lipidomics and clinical data using a machine learning method [J].
Acharjee, Animesh ;
Ament, Zsuzsanna ;
West, James A. ;
Stanley, Elizabeth ;
Griffin, Julian L. .
BMC BIOINFORMATICS, 2016, 17
[3]   Proteomics for discovery of candidate colorectal cancer biomarkers [J].
Alvarez-Chaver, Paula ;
Otero-Estevez, Olalla ;
Paez de la Cadena, Maria ;
Rodriguez-Berrocal, Francisco J. ;
Martinez-Zorzano, Vicenta S. .
WORLD JOURNAL OF GASTROENTEROLOGY, 2014, 20 (14) :3804-3824
[4]   High ABCC2 and Low ABCG2 Gene Expression Are Early Events in the Colorectal Adenoma-Carcinoma Sequence [J].
Andersen, Vibeke ;
Vogel, Lotte K. ;
Kopp, Tine Iskov ;
Saebo, Mona ;
Nonboe, Annika W. ;
Hamfjord, Julian ;
Kure, Elin H. ;
Vogel, Ulla .
PLOS ONE, 2015, 10 (03)
[5]  
[Anonymous], 2021, IEEE Trans. Broadcast.
[6]   Colorectal Cancer Screening: Stool DNA and Other Noninvasive Modalities [J].
Bailey, James R. ;
Aggarwal, Ashish ;
Imperiale, Thomas F. .
GUT AND LIVER, 2016, 10 (02) :204-211
[7]   NCBI GEO: archive for functional genomics data sets-update [J].
Barrett, Tanya ;
Wilhite, Stephen E. ;
Ledoux, Pierre ;
Evangelista, Carlos ;
Kim, Irene F. ;
Tomashevsky, Maxim ;
Marshall, Kimberly A. ;
Phillippy, Katherine H. ;
Sherman, Patti M. ;
Holko, Michelle ;
Yefanov, Andrey ;
Lee, Hyeseung ;
Zhang, Naigong ;
Robertson, Cynthia L. ;
Serova, Nadezhda ;
Davis, Sean ;
Soboleva, Alexandra .
NUCLEIC ACIDS RESEARCH, 2013, 41 (D1) :D991-D995
[8]   An empirical comparison of voting classification algorithms: Bagging, boosting, and variants [J].
Bauer, E ;
Kohavi, R .
MACHINE LEARNING, 1999, 36 (1-2) :105-139
[9]  
Bogaert J, 2014, ANN GASTROENTEROL, V27, P9
[10]   Translational biomarkers in the era of precision medicine [J].
Bravo-Merodio, Laura ;
Acharjee, Animesh ;
Russ, Dominic ;
Bisht, Vartika ;
Williams, John A. ;
Tsaprouni, Loukia G. ;
Gkoutos, Georgios V. .
ADVANCES IN CLINICAL CHEMISTRY, VOL 102, 2021, 102 :191-232