Gene Expression Value Prediction Based on XGBoost Algorithm

被引:197
作者
Li, Wei [1 ]
Yin, Yanbin [2 ]
Quan, Xiongwen [1 ]
Zhang, Han [1 ,3 ]
机构
[1] Nankai Univ, Coll Artificial Intelligence, Tianjin, Peoples R China
[2] Univ Nebraska, Dept Food Sci & Technol, Lincoln, NE 68583 USA
[3] Nankai Univ, Key Lab Med Data Anal & Stat Res Tianjin, Tianjin, Peoples R China
基金
中国国家社会科学基金;
关键词
gene expression value; landmark gene; target gene; regression method; XGBoost; absolute error; NETWORKS;
D O I
10.3389/fgene.2019.01077
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Gene expression profiling has been widely used to characterize cell status to reflect the health of the body, to diagnose genetic diseases, etc. In recent years, although the cost of genome-wide expression profiling is gradually decreasing, the cost of collecting expression profiles for thousands of genes is still very high. Considering gene expressions are usually highly correlated in humans, the expression values of the remaining target genes can be predicted by analyzing the values of 943 landmark genes. Hence, we designed an algorithm for predicting gene expression values based on XGBoost, which integrates multiple tree models and has stronger interpretability. We tested the performance of XGBoost model on the GEO dataset and RNA-seq dataset and compared the result with other existing models. Experiments showed that the XGBoost model achieved a significantly lower overall error than the existing D-GEX algorithm, linear regression, and KNN methods. In conclusion, the XGBoost algorithm outperforms existing models and will be a significant contribution to the toolbox for gene expression value prediction.
引用
收藏
页数:7
相关论文
共 22 条
[1]  
Aigner T, 2001, ARTHRITIS RHEUM-US, V44, P2777, DOI 10.1002/1529-0131(200112)44:12<2777::AID-ART465>3.0.CO
[2]  
2-H
[3]  
[Anonymous], 2014, THESIS
[4]   The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans [J].
Ardlie, Kristin G. ;
DeLuca, David S. ;
Segre, Ayellet V. ;
Sullivan, Timothy J. ;
Young, Taylor R. ;
Gelfand, Ellen T. ;
Trowbridge, Casandra A. ;
Maller, Julian B. ;
Tukiainen, Taru ;
Lek, Monkol ;
Ward, Lucas D. ;
Kheradpour, Pouya ;
Iriarte, Benjamin ;
Meng, Yan ;
Palmer, Cameron D. ;
Esko, Tonu ;
Winckler, Wendy ;
Hirschhorn, Joel N. ;
Kellis, Manolis ;
MacArthur, Daniel G. ;
Getz, Gad ;
Shabalin, Andrey A. ;
Li, Gen ;
Zhou, Yi-Hui ;
Nobel, Andrew B. ;
Rusyn, Ivan ;
Wright, Fred A. ;
Lappalainen, Tuuli ;
Ferreira, Pedro G. ;
Ongen, Halit ;
Rivas, Manuel A. ;
Battle, Alexis ;
Mostafavi, Sara ;
Monlong, Jean ;
Sammeth, Michael ;
Mele, Marta ;
Reverter, Ferran ;
Goldmann, Jakob M. ;
Koller, Daphne ;
Guigo, Roderic ;
McCarthy, Mark I. ;
Dermitzakis, Emmanouil T. ;
Gamazon, Eric R. ;
Im, Hae Kyung ;
Konkashbaev, Anuar ;
Nicolae, Dan L. ;
Cox, Nancy J. ;
Flutre, Timothee ;
Wen, Xiaoquan ;
Stephens, Matthew .
SCIENCE, 2015, 348 (6235) :648-660
[5]   Shambhala: a platform-agnostic data harmonizer for gene expression data [J].
Borisov, Nicolas ;
Shabalina, Irina ;
Tkachev, Victor ;
Sorokin, Maxim ;
Garazha, Andrew ;
Pulin, Andrey ;
Eremin, Ilya I. ;
Buzdin, Anton .
BMC BIOINFORMATICS, 2019, 20 (1)
[6]   Gene expression profiling:: monitoring transcription and translation products using DNA microarrays and proteomics [J].
Celis, JE ;
Kruhoffer, M ;
Gromova, I ;
Frederiksen, C ;
Ostergaard, M ;
Thykjaer, T ;
Gromov, P ;
Yu, JS ;
Pálsdóttir, H ;
Magnusson, N ;
Orntoft, TF .
FEBS LETTERS, 2000, 480 (01) :2-16
[7]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794
[8]   Gene expression inference with deep learning [J].
Chen, Yifei ;
Li, Yi ;
Narayan, Rajiv ;
Subramanian, Aravind ;
Xie, Xiaohui .
BIOINFORMATICS, 2016, 32 (12) :1832-1839
[9]  
Edgar R, 2008, NUCLEIC ACIDS RES, V30, P207, DOI DOI 10.1007/978-1-4020-6754-9_6552
[10]  
Hartigan J. A., 1979, Applied Statistics, V28, P100, DOI 10.2307/2346830