An Enhanced Random Forests Approach to Predict Heart Failure From Small Imbalanced Gene Expression Data

被引:11
作者
Chicco, Davide [1 ]
Oneto, Luca [2 ]
机构
[1] Krembil Res Inst, Toronto, ON M5T 0S8, Canada
[2] Univ Genoa, I-16126 Genoa, Italy
关键词
Heart attack; Gene expression; Myocardium; Machine learning; Random forests; Data preprocessing; Congestive heart failure; Heart failure; gene ranking; random forests; STEMI; infarction; gene expression; feature selection; machine learning; genetics; feature elimination; IDENTIFICATION; ROBUST;
D O I
10.1109/TCBB.2020.3041527
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Myocardial infarctions and heart failure are the cause of more than 17 million deaths annually worldwide. ST-segment elevation myocardial infarctions (STEMI) require timely treatment, because delays of minutes have serious clinical impacts. Machine learning can provide alternative ways to predict heart failure and identify genes involved in heart failure. For these scopes, we applied a Random Forests classifier enhanced with feature elimination to microarray gene expression of 111 patients diagnosed with STEMI, and measured the classification performance through standard metrics such as the Matthews correlation coefficient (MCC) and area under the receiver operating characteristic curve (ROC AUC). Afterwards, we used the same approach to rank all genes by importance, and to detect the genes more strongly associated with heart failure. We validated this ranking by literature review and gene set enrichment analysis. Our classifier employed to predict heart failure achieved MCC = +0.87 and ROC AUC = 0.918, and our analysis identified KLHL22, WDR11, OR4Q3, GPATCH3, and FAH as top five protein-coding genes related to heart failure. Our results confirm the effectiveness of machine learning feature elimination in predicting heart failure from gene expression, and the top genes found by our approach will be able to help biologists and cardiologists further our understanding of heart failure.
引用
收藏
页码:2759 / 2765
页数:7
相关论文
共 67 条
[1]  
[Anonymous], 2020, METTL7B
[2]  
[Anonymous], 2015, Data mining: the textbook
[3]   Pseudogenes: Are they "Junk" or functional DNA? [J].
Balakirev, ES ;
Ayala, FJ .
ANNUAL REVIEW OF GENETICS, 2003, 37 :123-151
[4]   Genetic polymorphisms in the carbonyl reductase 3 gene CBR3 and the NAD(P)H:Quinone oxidoreductase 1 gene NQO1 in patients who developed anthracycline-related congestive heart failure after childhood cancer [J].
Blanco, Javier G. ;
Leisenring, Wendy M. ;
Gonzalez-Covarrubias, Vanessa M. ;
Kawashima, Toana I. ;
Davies, Stella M. ;
Relling, Mary V. ;
Robison, Leslie L. ;
Sklar, Charles A. ;
Stovall, Marilyn ;
Bhatia, Smita .
CANCER, 2008, 112 (12) :2789-2795
[5]  
Breiman L., 2001, IEEE Trans. Broadcast., V45, P5
[6]  
Calle M Luz, 2011, Brief Bioinform, V12, P86, DOI 10.1093/bib/bbq011
[7]   Sustainable airport environments: A review of water conservation practices in airports [J].
Carvalho, Isabella de Castro ;
Calijuri, Maria Lucia ;
Assemany, Paula Peixoto ;
Freitas Machado e Silva, Marcos Dornelas ;
Moreira Neto, Ronan Fernandes ;
Santiago, Anibal da Fonseca ;
Batalha de Souza, Mauro Henrique .
RESOURCES CONSERVATION AND RECYCLING, 2013, 74 :27-36
[8]   Removing Batch Effects in Analysis of Expression Microarray Data: An Evaluation of Six Batch Adjustment Methods [J].
Chen, Chao ;
Grennan, Kay ;
Badner, Judith ;
Zhang, Dandan ;
Gershon, Elliot ;
Jin, Li ;
Liu, Chunyu .
PLOS ONE, 2011, 6 (02)
[9]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794
[10]  
Chicco D., 2020, ADVANTAGES MATTHEWS, P1