Using machine learning to realize genetic site screening and genomic prediction of productive traits in pigs

被引：8

作者：

Xiang, Tao ^{[1
,2
]}

Li, Tao ^{[3
,4
,5
]}

Li, Jielin ^{[1
,2
]}

Li, Xin ^{[3
,4
,5
]}

Wang, Jia ^{[3
,4
,5
]}

机构：

[1] Huazhong Agr Univ, Key Lab Agr Anim Genet Breeding & Reprod, Minist Educ, Wuhan, Peoples R China

[2] Huazhong Agr Univ, Key Lab Swine Genet & Breeding, Minist Agr, Wuhan, Peoples R China

[3] Huazhong Agr Univ, Coll Informat, 1 Shizishan St, Wuhan 430070, Peoples R China

[4] Huazhong Agr Univ, Key Lab Smart Farming Agr Anim, Wuhan, Peoples R China

[5] Huazhong Agr Univ, Hubei Key Lab Agr Bioinformat, Wuhan, Peoples R China

来源：

FASEB JOURNAL | 2023年 / 37卷 / 06期

关键词：

deep learning; feature selection; genomic prediction; machine learning; pigs; SELECTION; XGBOOST;

D O I：

10.1096/fj.202300245R

中图分类号：

Q5 [生物化学]; Q7 [分子生物学];

学科分类号：

071010 ; 081704 ;

摘要：

Genomic prediction, which is based on solving linear mixed-model (LMM) equations, is the most popular method for predicting breeding values or phenotypic performance for economic traits in livestock. With the need to further improve the performance of genomic prediction, nonlinear methods have been considered as an alternative and promising approach. The excellent ability to predict phenotypes in animal husbandry has been demonstrated by machine learning (ML) approaches, which have been rapidly developed. To investigate the feasibility and reliability of implementing genomic prediction using nonlinear models, the performances of genomic predictions for pig productive traits using the linear genomic selection model and nonlinear machine learning models were compared. Then, to reduce the high-dimensional features of genome sequence data, different machine learning algorithms, including the random forest (RF), support vector machine (SVM), extreme gradient boosting (XGBoost) and convolutional neural network (CNN) algorithms, were used to perform genomic feature selection as well as genomic prediction on reduced feature genome data. All of the analyses were processed on two real pig datasets: the published PIC pig dataset and a dataset comprising data from a national pig nucleus herd in Chifeng, North China. Overall, the accuracies of predicted phenotypic performance for traits T1, T2, T3 and T5 in the PIC dataset and average daily gain (ADG) in the Chifeng dataset were higher using the ML methods than the LMM method, while those for trait T4 in the PIC dataset and total number of piglets born (TNB) in the Chifeng dataset were slightly lower using the ML methods than the LMM method. Among all the different ML algorithms, SVM was the most appropriate for genomic prediction. For the genomic feature selection experiment, the most stable and most accurate results across different algorithms were achieved using XGBoost in combination with the SVM algorithm. Through feature selection, the number of genomic markers can be reduced to 1 in 20, while the predictive performance on some traits can even be improved compared to using the full genome data. Finally, we developed a new tool that can be used to execute combined XGBoost and SVM algorithms to realize genomic feature selection and phenotypic prediction.

引用

页数：14

共 44 条

[1] Application of Artificial Neural Network and Support Vector Machines in Predicting Metabolizable Energy in Compound Feeds for Pigs.
Ahmadi, Hamed
Rodehutscord, Markus
[J]. FRONTIERS IN NUTRITION, 2017, 4
[2] A survey on swarm intelligence approaches to feature selection in data mining
Bach Hoai Nguyen
Xue, Bing
Zhang, Mengjie
[J]. SWARM AND EVOLUTIONARY COMPUTATION, 2020, 54
[3] A comparative analysis of gradient boosting algorithms
Bentejac, Candice
Csorgo, Anna
Martinez-Munoz, Gonzalo
[J]. ARTIFICIAL INTELLIGENCE REVIEW, 2021, 54 (03) : 1937 - 1967
[4] Random forests
Breiman, L
[J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
[5] Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering
Browning, Sharon R.
Browning, Brian L.
[J]. AMERICAN JOURNAL OF HUMAN GENETICS, 2007, 81 (05) : 1084 - 1097
[6] Improving protein-protein interactions prediction accuracy using XGBoost feature selection and stacked ensemble classifier
Chen, Cheng
Zhang, Qingmei
Yu, Bin
Yu, Zhaomin
Lawrence, Patrick J.
Ma, Qin
Zhang, Yan
[J]. COMPUTERS IN BIOLOGY AND MEDICINE, 2020, 123
[7] A Common Dataset for Genomic Analysis of Livestock Populations
Cleveland, Matthew A.
Hickey, John M.
Forni, Selma
[J]. G3-GENES GENOMES GENETICS, 2012, 2 (04): : 429 - 435
[8] Gene selection and classification of microarray data using random forest -: art. no. 3
Díaz-Uriarte, R
de Andrés, SA
[J]. BMC BIOINFORMATICS, 2006, 7 (1)
[9] Pathway analysis using XGBoost classification in Biomedical Data
Dimitrakopoulos, Georgios N.
Vrahatis, Aristidis G.
Plagianakos, Vassilis
Sgarbas, Kyriakos
[J]. 10TH HELLENIC CONFERENCE ON ARTIFICIAL INTELLIGENCE (SETN 2018), 2018,
[10] Support vector machines for spam categorization
Drucker, H
Wu, DH
Vapnik, VN
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 1999, 10 (05): : 1048 - 1054

← 1 2 3 4 5 →