Leveraging machine learning to advance genome-wide association studies

被引:0
作者
Dagasso, Gabrielle [1 ]
Yan, Yan [2 ]
Wang, Lipu [3 ]
Li, Longhai [4 ]
Kutcher, Randy [3 ]
Zhang, Wentao [5 ]
Jin, Lingling [6 ]
机构
[1] Thompson Rivers Univ, Dept Math & Stat, Kamloops, BC, Canada
[2] Thompson Rivers Univ, Dept Comp Sci, Kamloops, BC, Canada
[3] Univ Saskatchewan, Dept Plant Sci, Saskatoon, SK, Canada
[4] Univ Saskatchewan, Dept Math & Stat, Saskatoon, SK, Canada
[5] Natl Res Council Canada, Saskatoon, SK, Canada
[6] Univ Saskatchewan, Dept Comp Sci, Saskatoon, SK, Canada
关键词
genome-wide association studies; machine learning; population structure analysis; cross-validation; LASSO; fusarium head blight; SOFTWARE; MODELS;
D O I
10.1504/IJDMB.2021.116881
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Genome-Wide Association Studies (GWAS) has demonstrated its power in discovering genetic variations to particular traits related to agronomically important features in crops. The typical output of a GWAS program includes a series of Single Nucleotide Polymorphisms (SNPs) and their significance. Currently, there is no standard way to compare results across different programs or to select the most 'significant' results uniformly and consistently. To obtain a comprehensive and accurate set of SNPs associated with a trait of interest, we present a novel automated pipeline that leverages machine learning for GWAS discoveries. The pipeline first performs population structure analysis, then executes multiple GWAS software and combines their results into a single SNP set. After that, it selects SNPs from the set with high individual and/or joint effects with the Least Absolute Shrinkage and Selection Operator analysis. Finally, the predictivity of the model is assessed using cross-validation.
引用
收藏
页码:17 / 36
页数:20
相关论文
共 25 条
  • [1] CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING
    BENJAMINI, Y
    HOCHBERG, Y
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) : 289 - 300
  • [2] TASSEL: software for association mapping of complex traits in diverse samples
    Bradbury, Peter J.
    Zhang, Zhiwu
    Kroon, Dallas E.
    Casstevens, Terry M.
    Ramdoss, Yogesh
    Buckler, Edward S.
    [J]. BIOINFORMATICS, 2007, 23 (19) : 2633 - 2635
  • [3] Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study
    Evanno, G
    Regnaut, S
    Goudet, J
    [J]. MOLECULAR ECOLOGY, 2005, 14 (08) : 2611 - 2620
  • [4] Francis R.M, 2019, TABULATE ANALYSE VIS
  • [5] Regularization Paths for Generalized Linear Models via Coordinate Descent
    Friedman, Jerome
    Hastie, Trevor
    Tibshirani, Rob
    [J]. JOURNAL OF STATISTICAL SOFTWARE, 2010, 33 (01): : 1 - 22
  • [6] Genomic selection
    Goddard, M. E.
    Hayes, B. J.
    [J]. JOURNAL OF ANIMAL BREEDING AND GENETICS, 2007, 124 (06) : 323 - 330
  • [7] Hilton AJ, 1999, PLANT PATHOL, V48, P202, DOI 10.1046/j.1365-3059.1999.00339.x
  • [8] GAPIT: genome association and prediction integrated tool
    Lipka, Alexander E.
    Tian, Feng
    Wang, Qishan
    Peiffer, Jason
    Li, Meng
    Bradbury, Peter J.
    Gore, Michael A.
    Buckler, Edward S.
    Zhang, Zhiwu
    [J]. BIOINFORMATICS, 2012, 28 (18) : 2397 - 2399
  • [9] Lippert C, 2011, NAT METHODS, V8, P833, DOI [10.1038/NMETH.1681, 10.1038/nmeth.1681]
  • [10] Efficient Bayesian mixed-model analysis increases association power in large cohorts
    Loh, Po-Ru
    Tucker, George
    Bulik-Sullivan, Brendan K.
    Vilhjalmsson, Bjarni J.
    Finucane, Hilary K.
    Salem, Rany M.
    Chasman, Daniel I.
    Ridker, Paul M.
    Neale, Benjamin M.
    Berger, Bonnie
    Patterson, Nick
    Price, Alkes L.
    [J]. NATURE GENETICS, 2015, 47 (03) : 284 - +