Leveraging machine learning to advance genome-wide association studies

被引：0

作者：

Dagasso, Gabrielle ^{[1
]}

Yan, Yan ^{[2
]}

Wang, Lipu ^{[3
]}

Li, Longhai ^{[4
]}

Kutcher, Randy ^{[3
]}

Zhang, Wentao ^{[5
]}

Jin, Lingling ^{[6
]}

机构：

[1] Thompson Rivers Univ, Dept Math & Stat, Kamloops, BC, Canada

[2] Thompson Rivers Univ, Dept Comp Sci, Kamloops, BC, Canada

[3] Univ Saskatchewan, Dept Plant Sci, Saskatoon, SK, Canada

[4] Univ Saskatchewan, Dept Math & Stat, Saskatoon, SK, Canada

[5] Natl Res Council Canada, Saskatoon, SK, Canada

[6] Univ Saskatchewan, Dept Comp Sci, Saskatoon, SK, Canada

来源：

INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS | 2021年 / 25卷 / 1-2期

关键词：

genome-wide association studies; machine learning; population structure analysis; cross-validation; LASSO; fusarium head blight; SOFTWARE; MODELS;

D O I：

10.1504/IJDMB.2021.116881

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Genome-Wide Association Studies (GWAS) has demonstrated its power in discovering genetic variations to particular traits related to agronomically important features in crops. The typical output of a GWAS program includes a series of Single Nucleotide Polymorphisms (SNPs) and their significance. Currently, there is no standard way to compare results across different programs or to select the most 'significant' results uniformly and consistently. To obtain a comprehensive and accurate set of SNPs associated with a trait of interest, we present a novel automated pipeline that leverages machine learning for GWAS discoveries. The pipeline first performs population structure analysis, then executes multiple GWAS software and combines their results into a single SNP set. After that, it selects SNPs from the set with high individual and/or joint effects with the Least Absolute Shrinkage and Selection Operator analysis. Finally, the predictivity of the model is assessed using cross-validation.

引用

页码：17 / 36

页数：20

共 25 条

[1] CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING
BENJAMINI, Y
HOCHBERG, Y
[J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) : 289 - 300
[2] TASSEL: software for association mapping of complex traits in diverse samples
Bradbury, Peter J.
Zhang, Zhiwu
Kroon, Dallas E.
Casstevens, Terry M.
Ramdoss, Yogesh
Buckler, Edward S.
[J]. BIOINFORMATICS, 2007, 23 (19) : 2633 - 2635
[3] Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study
Evanno, G
Regnaut, S
Goudet, J
[J]. MOLECULAR ECOLOGY, 2005, 14 (08) : 2611 - 2620
[4] Francis R.M, 2019, TABULATE ANALYSE VIS
[5] Regularization Paths for Generalized Linear Models via Coordinate Descent
Friedman, Jerome
Hastie, Trevor
Tibshirani, Rob
[J]. JOURNAL OF STATISTICAL SOFTWARE, 2010, 33 (01): : 1 - 22
[6] Genomic selection
Goddard, M. E.
Hayes, B. J.
[J]. JOURNAL OF ANIMAL BREEDING AND GENETICS, 2007, 124 (06) : 323 - 330
[7] Hilton AJ, 1999, PLANT PATHOL, V48, P202, DOI 10.1046/j.1365-3059.1999.00339.x
[8] GAPIT: genome association and prediction integrated tool
Lipka, Alexander E.
Tian, Feng
Wang, Qishan
Peiffer, Jason
Li, Meng
Bradbury, Peter J.
Gore, Michael A.
Buckler, Edward S.
Zhang, Zhiwu
[J]. BIOINFORMATICS, 2012, 28 (18) : 2397 - 2399
[9] Lippert C, 2011, NAT METHODS, V8, P833, DOI [10.1038/NMETH.1681, 10.1038/nmeth.1681]
[10] Efficient Bayesian mixed-model analysis increases association power in large cohorts
Loh, Po-Ru
Tucker, George
Bulik-Sullivan, Brendan K.
Vilhjalmsson, Bjarni J.
Finucane, Hilary K.
Salem, Rany M.
Chasman, Daniel I.
Ridker, Paul M.
Neale, Benjamin M.
Berger, Bonnie
Patterson, Nick
Price, Alkes L.
[J]. NATURE GENETICS, 2015, 47 (03) : 284 - +

← 1 2 3 →