Valid inference for machine learning-assisted genome-wide association studies

被引:2
作者
Miao, Jiacheng [1 ]
Wu, Yixuan [1 ]
Sun, Zhongxuan [1 ]
Miao, Xinran [2 ]
Lu, Tianyuan [3 ,4 ]
Zhao, Jiwei [1 ,2 ]
Lu, Qiongshi [1 ,2 ,5 ]
机构
[1] Univ Wisconsin Madison, Dept Biostat & Med Informat, Madison, WI 53706 USA
[2] Univ Wisconsin Madison, Dept Stat, Madison, WI 53706 USA
[3] Jewish Gen Hosp, Lady Davis Inst Med Res, Montreal, PQ, Canada
[4] Univ Toronto, Dept Stat Sci, Toronto, ON, Canada
[5] Univ Wisconsin Madison, Ctr Demog Hlth & Aging, Madison, WI 53706 USA
基金
美国国家卫生研究院;
关键词
METAANALYSIS; RESOURCE; GENETICS; DENSITY;
D O I
10.1038/s41588-024-01934-0
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Machine learning (ML) has become increasingly popular in almost all scientific disciplines, including human genetics. Owing to challenges related to sample collection and precise phenotyping, ML-assisted genome-wide association study (GWAS), which uses sophisticated ML techniques to impute phenotypes and then performs GWAS on the imputed outcomes, have become increasingly common in complex trait genetics research. However, the validity of ML-assisted GWAS associations has not been carefully evaluated. Here, we report pervasive risks for false-positive associations in ML-assisted GWAS and introduce Post-Prediction GWAS (POP-GWAS), a statistical framework that redesigns GWAS on ML-imputed outcomes. POP-GWAS ensures valid and powerful statistical inference irrespective of imputation quality and choice of algorithm, requiring only GWAS summary statistics as input. We employed POP-GWAS to perform a GWAS of bone mineral density derived from dual-energy X-ray absorptiometry imaging at 14 skeletal sites, identifying 89 new loci and revealing skeletal site-specific genetic architecture. Our framework offers a robust analytic solution for future ML-assisted GWAS. Post-prediction genome-wide association study (POP-GWAS) is a statistical framework that uses summary statistics from labeled samples with both observed and imputed phenotypes to debias single-nucleotide polymorphism effect size estimates for unlabeled samples with imputed phenotypes only, leading to valid and powerful inference.
引用
收藏
页码:2361 / 2369
页数:15
相关论文
共 65 条
  • [1] Large-scale machine-learning-based phenotyping significantly improves genomic discovery for optic nerve head morphology
    Alipanahi, Babak
    Hormozdiari, Farhad
    Behsaz, Babak
    Cosentino, Justin
    McCaw, Zachary R.
    Schorsch, Emanuel
    Sculley, D.
    Dorfman, Elizabeth H.
    Foster, Paul J.
    Peng, Lily H.
    Phene, Sonia
    Hammel, Naama
    Carroll, Andrew
    Khawaja, Anthony P.
    McLean, Cory Y.
    [J]. AMERICAN JOURNAL OF HUMAN GENETICS, 2021, 108 (07) : 1217 - 1230
  • [2] Deep learning-based phenotype imputation on population-scale biobank data increases genetic discoveries
    An, Ulzee
    Pazokitoroudi, Ali
    Alvarez, Marcus
    Huang, Lianyun
    Bacanu, Silviu
    Schork, Andrew J.
    Kendler, Kenneth
    Pajukanta, Paeivi
    Flint, Jonathan
    Zaitlen, Noah
    Cai, Na
    Dahl, Andy
    Sankararaman, Sriram
    [J]. NATURE GENETICS, 2023, 55 (12) : 2269 - 2272
  • [3] Prediction-powered inference
    Angelopoulos, Anastasios N.
    Bates, Stephen
    Fannjiang, Clara
    Jordan, Michael I.
    Zrnic, Tijana
    [J]. SCIENCE, 2023, 382 (6671) : 669 - 674
  • [4] Deciphering osteoarthritis genetics across 826,690 individuals from 9 populations
    Boer, Cindy G.
    Hatzikotoulas, Konstantinos
    Southam, Lorraine
    Stefansdottir, Lilja
    Zhang, Yanfei
    de Almeida, Rodrigo Coutinho
    Wu, Tian T.
    Zheng, Jie
    Hartley, April
    Teder-Laving, Maris
    Skogholt, Anne Heidi
    Terao, Chikashi
    Zengini, Eleni
    Alexiadis, George
    Barysenka, Andrei
    Bjornsdottir, Gyda
    Gabrielsen, Maiken E.
    Gilly, Arthur
    Ingvarsson, Thorvaldur
    Johnsen, Marianne B.
    Jonsson, Helgi
    Kloppenburg, Margreet
    Luetge, Almut
    Lund, Sigrun H.
    Magi, Reedik
    Mangino, Massimo
    Nelissen, Rob R. G. H. H.
    Shivakumar, Manu
    Steinberg, Julia
    Takuwa, Hiroshi
    Thomas, Laurent F.
    Tuerlings, Margo
    Babis, George C.
    Cheung, Jason Pui Yin
    Kang, Jae Hee
    Kraft, Peter
    Lietman, Steven A.
    Samartzis, Dino
    Slagboom, P. Eline
    Stefansson, Kari
    Thorsteinsdottir, Unnur
    Tobias, Jonathan H.
    Uitterlinden, Andre G.
    Winsvold, Bendik
    Zwart, John-Anker
    Smith, George Davey
    Sham, Pak Chung
    Thorleifsson, Gudmar
    Gaunt, Tom R.
    Morris, Andrew P.
    [J]. CELL, 2021, 184 (18) : 4784 - +
  • [5] An atlas of genetic correlations across human diseases and traits
    Bulik-Sullivan, Brendan
    Finucane, Hilary K.
    Anttila, Verneri
    Gusev, Alexander
    Day, Felix R.
    Loh, Po-Ru
    Duncan, Laramie
    Perry, John R. B.
    Patterson, Nick
    Robinson, Elise B.
    Daly, Mark J.
    Price, Alkes L.
    Neale, Benjamin M.
    [J]. NATURE GENETICS, 2015, 47 (11) : 1236 - +
  • [6] LD Score regression distinguishes confounding from polygenicity in genome-wide association studies
    Bulik-Sullivan, Brendan K.
    Loh, Po-Ru
    Finucane, Hilary K.
    Ripke, Stephan
    Yang, Jian
    Patterson, Nick
    Daly, Mark J.
    Price, Alkes L.
    Neale, Benjamin M.
    [J]. NATURE GENETICS, 2015, 47 (03) : 291 - +
  • [7] Genome-wide analysis of a model-derived binge eating disorder phenotype identifies risk loci and implicates iron metabolism
    Burstein, David
    Griffen, Trevor C.
    Therrien, Karen
    Bendl, Jaroslav
    Venkatesh, Sanan
    Dong, Pengfei
    Modabbernia, Amirhossein
    Zeng, Biao
    Mathur, Deepika
    Hoffman, Gabriel
    Sysko, Robyn
    Hildebrandt, Tom
    Voloudakis, Georgios
    Roussos, Panos
    [J]. NATURE GENETICS, 2023, 55 (09) : 1462 - +
  • [8] The UK Biobank resource with deep phenotyping and genomic data
    Bycroft, Clare
    Freeman, Colin
    Petkova, Desislava
    Band, Gavin
    Elliott, Lloyd T.
    Sharp, Kevin
    Motyer, Allan
    Vukcevic, Damjan
    Delaneau, Olivier
    O'Connell, Jared
    Cortes, Adrian
    Welsh, Samantha
    Young, Alan
    Effingham, Mark
    McVean, Gil
    Leslie, Stephen
    Allen, Naomi
    Donnelly, Peter
    Marchini, Jonathan
    [J]. NATURE, 2018, 562 (7726) : 203 - +
  • [9] The trans-ancestral genomic architecture of glycemic traits
    Chen, Ji
    Spracklen, Cassandra N.
    Marenne, Gaelle
    Varshney, Arushi
    Corbin, Laura J.
    Luan, Jian'an
    Willems, Sara M.
    Wu, Ying
    Zhang, Xiaoshuai
    Horikoshi, Momoko
    Boutin, Thibaud S.
    Magi, Reedik
    Waage, Johannes
    Li-Gao, Ruifang
    Chan, Kei Hang Katie
    Yao, Jie
    Anasanti, Mila D.
    Chu, Audrey Y.
    Claringbould, Annique
    Heikkinen, Jani
    Hong, Jaeyoung
    Hottenga, Jouke-Jan
    Huo, Shaofeng
    Kaakinen, Marika A.
    Louie, Tin
    Maerz, Winfried
    Moreno-Macias, Hortensia
    Ndungu, Anne
    Nelson, Sarah C.
    Nolte, Ilja M.
    North, Kari E.
    Raulerson, Chelsea K.
    Ray, Debashree
    Rohde, Rebecca
    Rybin, Denis
    Schurmann, Claudia
    Sim, Xueling
    Southam, Lorraine
    Stewart, Isobel D.
    Wang, Carol A.
    Wang, Yujie
    Wu, Peitao
    Zhang, Weihua
    Ahluwalia, Tarunveer S.
    Appel, Emil V. R.
    Bielak, Lawrence F.
    Brody, Jennifer A.
    Burtt, Noel P.
    Cabrera, Claudia P.
    Cade, Brian E.
    [J]. NATURE GENETICS, 2021, 53 (06) : 840 - +
  • [10] Inference of chronic obstructive pulmonary disease with deep learning on raw spirograms identifies new genetic loci and improves risk models
    Cosentino, Justin
    Behsaz, Babak
    Alipanahi, Babak
    McCaw, Zachary R.
    Hill, Davin
    Schwantes-An, Tae-Hwi
    Lai, Dongbing
    Carroll, Andrew
    Hobbs, Brian D.
    Cho, Michael H.
    McLean, Cory Y.
    Hormozdiari, Farhad
    [J]. NATURE GENETICS, 2023, 55 (05) : 787 - +