Quantifying unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects

被引:0
作者
James Zou
Gregory Valiant
Paul Valiant
Konrad Karczewski
Siu On Chan
Kaitlin Samocha
Monkol Lek
Shamil Sunyaev
Mark Daly
Daniel G. MacArthur
机构
[1] Stanford University,Department of Biomedical Data Science
[2] Stanford University,Computer Science Department
[3] Brown University,Computer Science Department
[4] Analytic and Translational Genetics Unit,Division of Genetics
[5] Massachusetts General Hospital,Department of Medicine
[6] Broad Institute or MIT and Harvard,undefined
[7] Computer Science and Engineering,undefined
[8] Chinese University of Hong Kong,undefined
[9] Brigham and Women’s Hospital,undefined
[10] Harvard Medical School,undefined
[11] Harvard Medical School,undefined
来源
Nature Communications | / 7卷
关键词
D O I
暂无
中图分类号
学科分类号
摘要
As new proposals aim to sequence ever larger collection of humans, it is critical to have a quantitative framework to evaluate the statistical power of these projects. We developed a new algorithm, UnseenEst, and applied it to the exomes of 60,706 individuals to estimate the frequency distribution of all protein-coding variants, including rare variants that have not been observed yet in the current cohorts. Our results quantified the number of new variants that we expect to identify as sequencing cohorts reach hundreds of thousands of individuals. With 500K individuals, we find that we expect to capture 7.5% of all possible loss-of-function variants and 12% of all possible missense variants. We also estimate that 2,900 genes have loss-of-function frequency of <0.00001 in healthy humans, consistent with very strong intolerance to gene inactivation.
引用
收藏
相关论文
共 41 条
[1]  
Auton A(2015)A global reference for human genetic variation Nature 526 68-74
[2]  
Macarthur DG(2012)A systematic survey of loss-of-function variants in human protein-coding genes Science 335 823-829
[3]  
Collins FS(2015)A new initiative on precision medicine N. Engl. J. Med. 372 793-795
[4]  
Varmus H(2009)Estimating the number of unseen variants in the human genome Proc. Natl Acad. Sci. USA 106 5008-5013
[5]  
Ionita-Laza I(2014)Predicting discovery rates of genomic features Genetics 197 601-610
[6]  
Lange CM(2015)Estimating the mutation load in human genomes Nat. Rev. Genet. 16 333-343
[7]  
Laird N(2014)Searching for missing heritability: designing rare variant association studies Proc. Natl Acad. Sci. USA 111 E455-E464
[8]  
Gravel S(1998)Distortion of allele frequency distributions provides a test for recent population bottlenecks J. Hered. 89 238-247
[9]  
Henn BM(2009)Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data PLoS Genet. 5 e1000695-24
[10]  
Botigué LR(2001)On the quantity and quality of single nucleotide polymorphisms in the human genome Stoch. Process. Appl. 93 1-1814