Deep learning-based phenotype imputation on population-scale biobank data increases genetic discoveries

被引:14
作者
An, Ulzee [1 ]
Pazokitoroudi, Ali [1 ]
Alvarez, Marcus [2 ]
Huang, Lianyun [3 ,4 ,5 ]
Bacanu, Silviu [6 ,7 ]
Schork, Andrew J. [8 ,9 ,10 ]
Kendler, Kenneth [6 ,7 ]
Pajukanta, Paeivi [2 ,11 ]
Flint, Jonathan [2 ]
Zaitlen, Noah [12 ]
Cai, Na [3 ,4 ,5 ]
Dahl, Andy [13 ]
Sankararaman, Sriram [1 ,2 ,14 ]
机构
[1] UCLA, Comp Sci Dept, Los Angeles, CA 90095 USA
[2] UCLA, David Geffen Sch Med, Dept Human Genet, Los Angeles, CA 90095 USA
[3] Helmholtz Zentrum Munchen, Helmholtz Pioneer Campus, Neuherberg, Germany
[4] Helmholtz Zentrum Munchen, Computat Hlth Ctr, Neuherberg, Germany
[5] Tech Univ Munich, Sch Med, Munich, Germany
[6] Virginia Commonwealth Univ, Virginia Inst Psychiat & Behav Genet, Richmond, VA USA
[7] Virginia Commonwealth Univ, Dept Psychiat, Richmond, VA USA
[8] Copenhagen Univ Hosp, Inst Biol Psychiat, Mental Hlth Ctr Sct Hans, Copenhagen, Denmark
[9] Translat Genom Res Inst TGEN, Neurogenom Div, Phoenix, AZ USA
[10] Univ Copenhagen, GLOBE Inst, Fac Hlth & Med Sci, Sect Geogenet, Copenhagen, Denmark
[11] UCLA, Inst Precis Hlth, David Geffen Sch Med, Los Angeles, CA USA
[12] UCLA, Neurol Dept, Los Angeles, CA USA
[13] Univ Chicago, Sect Genet Med, Chicago, IL USA
[14] UCLA, Dept Computat Med, Los Angeles, CA 90095 USA
基金
美国国家卫生研究院; 美国国家科学基金会;
关键词
GENOME-WIDE ASSOCIATION; LD SCORE REGRESSION; DNA;
D O I
10.1038/s41588-023-01558-w
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Biobanks that collect deep phenotypic and genomic data across many individuals have emerged as a key resource in human genetics. However, phenotypes in biobanks are often missing across many individuals, limiting their utility. We propose AutoComplete, a deep learning-based imputation method to impute or 'fill-in' missing phenotypes in population-scale biobank datasets. When applied to collections of phenotypes measured across similar to 300,000 individuals from the UK Biobank, AutoComplete substantially improved imputation accuracy over existing methods. On three traits with notable amounts of missingness, we show that AutoComplete yields imputed phenotypes that are genetically similar to the originally observed phenotypes while increasing the effective sample size by about twofold on average. Further, genome-wide association analyses on the resulting imputed phenotypes led to a substantial increase in the number of associated loci. Our results demonstrate the utility of deep learning-based phenotype imputation to increase power for genetic discoveries in existing biobank datasets.
引用
收藏
页码:2269 / 2272
页数:4
相关论文
共 43 条
  • [1] FlashPCA2: principal component analysis of Biobank-scale genotype datasets
    Abraham, Gad
    Qiu, Yixuan
    Inouye, Michael
    [J]. BIOINFORMATICS, 2017, 33 (17) : 2776 - 2778
  • [2] Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning
    Alipanahi, Babak
    Delong, Andrew
    Weirauch, Matthew T.
    Frey, Brendan J.
    [J]. NATURE BIOTECHNOLOGY, 2015, 33 (08) : 831 - +
  • [3] A global reference for human genetic variation
    Altshuler, David M.
    Durbin, Richard M.
    Abecasis, Goncalo R.
    Bentley, David R.
    Chakravarti, Aravinda
    Clark, Andrew G.
    Donnelly, Peter
    Eichler, Evan E.
    Flicek, Paul
    Gabriel, Stacey B.
    Gibbs, Richard A.
    Green, Eric D.
    Hurles, Matthew E.
    Knoppers, Bartha M.
    Korbel, Jan O.
    Lander, Eric S.
    Lee, Charles
    Lehrach, Hans
    Mardis, Elaine R.
    Marth, Gabor T.
    McVean, Gil A.
    Nickerson, Deborah A.
    Wang, Jun
    Wilson, Richard K.
    Boerwinkle, Eric
    Doddapaneni, Harsha
    Han, Yi
    Korchina, Viktoriya
    Kovar, Christie
    Lee, Sandra
    Muzny, Donna
    Reid, Jeffrey G.
    Zhu, Yiming
    Chang, Yuqi
    Feng, Qiang
    Fang, Xiaodong
    Guo, Xiaosen
    Jian, Min
    Jiang, Hui
    Jin, Xin
    Lan, Tianming
    Li, Guoqing
    Li, Jingxiang
    Li, Yingrui
    Liu, Shengmao
    Liu, Xiao
    Lu, Yao
    Ma, Xuedi
    Tang, Meifang
    Wang, Bo
    [J]. NATURE, 2015, 526 (7571) : 68 - +
  • [4] DeepImpute: an accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data
    Arisdakessian, Cedric
    Poirion, Olivier
    Yunits, Breck
    Zhu, Xun
    Garmire, Lana X.
    [J]. GENOME BIOLOGY, 2019, 20 (01)
  • [5] Beaulieu-Jones BK, 2017, BIOCOMPUT-PAC SYM, P207, DOI 10.1142/9789813207813_0021
  • [6] LD Score regression distinguishes confounding from polygenicity in genome-wide association studies
    Bulik-Sullivan, Brendan K.
    Loh, Po-Ru
    Finucane, Hilary K.
    Ripke, Stephan
    Yang, Jian
    Patterson, Nick
    Daly, Mark J.
    Price, Alkes L.
    Neale, Benjamin M.
    [J]. NATURE GENETICS, 2015, 47 (03) : 291 - +
  • [7] The UK Biobank resource with deep phenotyping and genomic data
    Bycroft, Clare
    Freeman, Colin
    Petkova, Desislava
    Band, Gavin
    Elliott, Lloyd T.
    Sharp, Kevin
    Motyer, Allan
    Vukcevic, Damjan
    Delaneau, Olivier
    O'Connell, Jared
    Cortes, Adrian
    Welsh, Samantha
    Young, Alan
    Effingham, Mark
    McVean, Gil
    Leslie, Stephen
    Allen, Naomi
    Donnelly, Peter
    Marchini, Jonathan
    [J]. NATURE, 2018, 562 (7726) : 203 - +
  • [8] Minimal phenotyping yields genome-wide association signals of low specificity for major depression
    Cai, Na
    Revez, Joana A.
    Adams, Mark J.
    Andlauer, Till F. M.
    Breen, Gerome
    Byrne, Enda M.
    Clarke, Toni-Kim
    Forstner, Andreas J.
    Grabe, Hans J.
    Hamilton, Steven P.
    Levinson, Douglas F.
    Lewis, Cathryn M.
    Lewis, Glyn
    Martin, Nicholas G.
    Milaneschi, Yuri
    Mors, Ole
    Mueller-Myhsok, Bertram
    Penninx, Brenda W. J. H.
    Perlis, Roy H.
    Pistis, Giorgio
    Potash, James B.
    Preisig, Martin
    Shi, Jianxin
    Smoller, Jordan W.
    Streit, Fabien
    Tiemeier, Henning
    Uher, Rudolf
    Van der Auwera, Sandra
    Viktorin, Alexander
    Weissman, Myrna M.
    Kendler, Kenneth S.
    Flint, Jonathan
    [J]. NATURE GENETICS, 2020, 52 (04) : 437 - +
  • [9] Second-generation PLINK: rising to the challenge of larger and richer datasets
    Chang, Christopher C.
    Chow, Carson C.
    Tellier, Laurent C. A. M.
    Vattikuti, Shashaank
    Purcell, Shaun M.
    Lee, James J.
    [J]. GIGASCIENCE, 2015, 4
  • [10] Phenotype integration improves power and preserves specificity in biobank-based genetic studies of major depressive disorder
    Dahl, Andrew
    Thompson, Michael
    An, Ulzee
    Krebs, Morten
    Appadurai, Vivek
    Border, Richard
    Bacanu, Silviu-Alin
    Werge, Thomas
    Flint, Jonathan
    Schork, Andrew J.
    Sankararaman, Sriram
    Kendler, Kenneth S.
    Cai, Na
    [J]. NATURE GENETICS, 2023, 55 (12) : 2082 - 2093