The gradient boosting algorithm and random boosting for genome-assisted evaluation in large data sets

被引:38
作者
Gonzalez-Recio, O. [1 ]
Jimenez-Montero, J. A. [2 ]
Alenda, R. [2 ]
机构
[1] Inst Nacl Invest & Tecnol Agr & Alimentaria INIA, Dept Mejora Genet Anim, Madrid 28040, Spain
[2] Univ Politecn Madrid, Escuela Tecn Super Ingn ETSI Agronomos, Dept Prod Anim, E-28040 Madrid, Spain
关键词
genomic evaluation; boosting; machine learning; predictive ability; PREDICTING QUANTITATIVE TRAITS; GENETIC EVALUATION; INFORMATION; PEDIGREE; MACHINE; MARKERS; MODELS;
D O I
10.3168/jds.2012-5630
中图分类号
S8 [畜牧、 动物医学、狩猎、蚕、蜂];
学科分类号
0905 ;
摘要
In the next few years, with the advent of high-density single nucleotide polymorphism (SNP) arrays and genome sequencing, genomic evaluation methods will need to deal with a large number of genetic variants and an increasing sample size. The boosting algorithm is a machine-learning technique that may alleviate the drawbacks of dealing with such large data sets. This algorithm combines different predictors in a sequential manner with some shrinkage on them; each predictor is applied consecutively to the residuals from the committee formed by the previous ones to form a final prediction based on a subset of covariates. Here, a detailed description is provided and examples using a toy data set are included. A modification of the algorithm called "random boosting" was proposed to increase predictive ability and decrease computation time of genome-assisted evaluation in large data sets. Random boosting uses a random selection of markers to add a subsequent weak learner to the predictive model. These modifications were applied to a real data set composed of 1,797 bulls genotyped for 39,714 SNP. Deregressed proofs of 4 yield traits and 1 type trait from January 2009 routine evaluations were used as dependent variables. A 2-fold cross-validation scenario was implemented. Sires born before 2005 were used as a training sample (1,576 and 1,562 for production and type traits, respectively), whereas younger sires were used as a testing sample to evaluate predictive ability of the algorithm on yet-to-be-observed phenotypes. Comparison with the original algorithm was provided. The predictive ability of the algorithm was measured as Pearson correlations between observed and predicted responses. Further, estimated bias was computed as the average difference between observed and predicted phenotypes. The results showed that the modification of the original boosting algorithm could be run in 1% of the time used with the original algorithm and with negligible differences in accuracy and bias. This modification may be used to speed the calculus of genome-assisted evaluation in large data sets such us those obtained from consortiums.
引用
收藏
页码:614 / 624
页数:11
相关论文
共 24 条
[1]   Hot topic: A unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score [J].
Aguilar, I. ;
Misztal, I. ;
Johnson, D. L. ;
Legarra, A. ;
Tsuruta, S. ;
Lawlor, T. J. .
JOURNAL OF DAIRY SCIENCE, 2010, 93 (02) :743-752
[2]   Boosting for high-dimensional linear models [J].
Buhlmann, Peter .
ANNALS OF STATISTICS, 2006, 34 (02) :559-583
[3]   Predicting Quantitative Traits With Regression Models for Dense Molecular Markers and Pedigree [J].
de los Campos, Gustavo ;
Naya, Hugo ;
Gianola, Daniel ;
Crossa, Jose ;
Legarra, Andres ;
Manfredi, Eduardo ;
Weigel, Kent ;
Cotes, Jose Miguel .
GENETICS, 2009, 182 (01) :375-385
[4]  
FREUND Y, 1996, 13 INT C MACH LEARN, P158
[5]   Greedy function approximation: A gradient boosting machine [J].
Friedman, JH .
ANNALS OF STATISTICS, 2001, 29 (05) :1189-1232
[6]   Genomic-assisted prediction of genetic value with semiparametric procedures [J].
Gianola, Daniel ;
Fernando, Rohan L. ;
Stella, Alessandra .
GENETICS, 2006, 173 (03) :1761-1776
[7]   Predicting complex quantitative traits with Bayesian neural networks: a case study with Jersey cows and wheat [J].
Gianola, Daniel ;
Okut, Hayrettin ;
Weigel, Kent A. ;
Rosa, Guilherme J. M. .
BMC GENETICS, 2011, 12
[8]   Additive Genetic Variability and the Bayesian Alphabet [J].
Gianola, Daniel ;
de los Campos, Gustavo ;
Hill, William G. ;
Manfredi, Eduardo ;
Fernando, Rohan .
GENETICS, 2009, 183 (01) :347-363
[9]  
Giovanni S., 2010, Ensemble Methods in Data Mining
[10]   An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings [J].
Goldstein, Benjamin A. ;
Hubbard, Alan E. ;
Cutler, Adele ;
Barcellos, Lisa F. .
BMC GENETICS, 2010, 11