Accurate detection and genotyping of SNPs utilizing population sequencing data

被引:78
作者
Bansal, Vikas [1 ]
Harismendy, Olivier [1 ]
Tewhey, Ryan [1 ]
Murray, Sarah S. [1 ]
Schork, Nicholas J. [1 ]
Topol, Eric J. [1 ]
Frazer, Kelly A. [1 ]
机构
[1] Scripps Res Inst, Scripps Translat Sci Inst, Scripps Genom Med, La Jolla, CA 92037 USA
关键词
SHORT READ ALIGNMENT; HUMAN GENOME; RARE VARIANTS; CONTRIBUTE; IMPUTATION; ULTRAFAST; GENES; SETS;
D O I
10.1101/gr.100040.109
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Next-generation sequencing technologies have made it possible to sequence targeted regions of the human genome in hundreds of individuals. Deep sequencing represents a powerful approach for the discovery of the complete spectrum of DNA sequence variants in functionally important genomic intervals. Current methods for single nucleotide polymorphism (SNP) detection are designed to detect SNPs from single individual sequence data sets. Here, we describe a novel method SNIP-Seq (single nucleotide polymorphism identification from population sequence data) that leverages sequence data from a population of individuals to detect SNPs and assign genotypes to individuals. To evaluate our method, we utilized sequence data from a 200-kilobase (kb) region on chromosome 9p21 of the human genome. This region was sequenced in 48 individuals (five sequenced in duplicate) using the Illumina GA platform. Using this data set, we demonstrate that our method is highly accurate for detecting variants and can filter out false SNPs that are attributable to sequencing errors. The concordance of sequencing- based genotype assignments between duplicate samples was 98.8%. The 200-kb region was independently sequenced to a high depth of coverage using two sequence pools containing the 48 individuals. Many of the novel SNPs identified by SNIP-Seq from the individual sequencing were validated by the pooled sequencing data and were subsequently confirmed by Sanger sequencing. We estimate that SNIP-Seq achieves a low false-positive rate of similar to 2%, improving upon the higher false-positive rate for existing methods that do not utilize population sequence data. Collectively, these results suggest that analysis of population sequencing data is a powerful approach for the accurate detection of SNPs and the assignment of genotypes to individual samples.
引用
收藏
页码:537 / 545
页数:9
相关论文
共 28 条
  • [1] Accurate whole human genome sequencing using reversible terminator chemistry
    Bentley, David R.
    Balasubramanian, Shankar
    Swerdlow, Harold P.
    Smith, Geoffrey P.
    Milton, John
    Brown, Clive G.
    Hall, Kevin P.
    Evers, Dirk J.
    Barnes, Colin L.
    Bignell, Helen R.
    Boutell, Jonathan M.
    Bryant, Jason
    Carter, Richard J.
    Cheetham, R. Keira
    Cox, Anthony J.
    Ellis, Darren J.
    Flatbush, Michael R.
    Gormley, Niall A.
    Humphray, Sean J.
    Irving, Leslie J.
    Karbelashvili, Mirian S.
    Kirk, Scott M.
    Li, Heng
    Liu, Xiaohai
    Maisinger, Klaus S.
    Murray, Lisa J.
    Obradovic, Bojan
    Ost, Tobias
    Parkinson, Michael L.
    Pratt, Mark R.
    Rasolonjatovo, Isabelle M. J.
    Reed, Mark T.
    Rigatti, Roberto
    Rodighiero, Chiara
    Ross, Mark T.
    Sabot, Andrea
    Sankar, Subramanian V.
    Scally, Aylwyn
    Schroth, Gary P.
    Smith, Mark E.
    Smith, Vincent P.
    Spiridou, Anastassia
    Torrance, Peta E.
    Tzonev, Svilen S.
    Vermaas, Eric H.
    Walter, Klaudia
    Wu, Xiaolin
    Zhang, Lu
    Alam, Mohammed D.
    Anastasi, Carole
    [J]. NATURE, 2008, 456 (7218) : 53 - 59
  • [2] Multiple rare variants in NPC1L1 associated with reduced sterol absorption and plasma low-density lipoprotein levels
    Cohen, JC
    Pertsemlidis, A
    Fahmi, S
    Esmail, S
    Vega, GL
    Grundy, SM
    Hobbs, HH
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2006, 103 (06) : 1810 - 1815
  • [3] Multiple rare Alleles contribute to low plasma levels of HDL cholesterol
    Cohen, JC
    Kiss, RS
    Pertsemlidis, A
    Marcel, YL
    McPherson, R
    Hobbs, HH
    [J]. SCIENCE, 2004, 305 (5685) : 869 - 872
  • [4] Craig DW, 2008, NAT METHODS, V5, P887, DOI [10.1038/nmeth.1251, 10.1038/NMETH.1251]
  • [5] Substantial biases in ultra-short read data sets from high-throughput DNA sequencing
    Dohm, Juliane C.
    Lottaz, Claudio
    Borodina, Tatiana
    Himmelbauer, Heinz
    [J]. NUCLEIC ACIDS RESEARCH, 2008, 36 (16)
  • [6] Alta-Cyclic: a selfoptimizing base caller for next-generation sequencing
    Erlich, Yaniv
    Mitra, Partha P.
    delaBastide, Melissa
    McCombie, W. Richard
    Hannon, Gregory J.
    [J]. NATURE METHODS, 2008, 5 (08) : 679 - 682
  • [7] Single nucleotide variation analysis in 65 candidate genes for CNS disorders in a representative sample of the European population
    Freudenberg-Hua, Y
    Freudenberg, J
    Kluck, N
    Cichon, S
    Propping, P
    Nöthen, MM
    [J]. GENOME RESEARCH, 2003, 13 (10) : 2271 - 2276
  • [8] Evaluation of next generation sequencing platforms for population targeted sequencing studies
    Harismendy, Olivier
    Ng, Pauline C.
    Strausberg, Robert L.
    Wang, Xiaoyun
    Stockwell, Timothy B.
    Beeson, Karen Y.
    Schork, Nicholas J.
    Murray, Sarah S.
    Topol, Eric J.
    Levy, Samuel
    Frazer, Kelly A.
    [J]. GENOME BIOLOGY, 2009, 10 (03):
  • [9] Genome-wide in situ exon capture for selective resequencing
    Hodges, Emily
    Xuan, Zhenyu
    Balija, Vivekanand
    Kramer, Melissa
    Molla, Michael N.
    Smith, Steven W.
    Middle, Christina M.
    Rodesch, Matthew J.
    Albert, Thomas J.
    Hannon, Gregory J.
    McCombie, W. Richard
    [J]. NATURE GENETICS, 2007, 39 (12) : 1522 - 1527
  • [10] Rare independent mutations in renal salt handling genes contribute to blood pressure variation
    Ji, Weizhen
    Foo, Jia Nee
    O'Roak, Brian J.
    Zhao, Hongyu
    Larson, Martin G.
    Simon, David B.
    Newton-Cheh, Christopher
    State, Matthew W.
    Levy, Daniel
    Lifton, Richard P.
    [J]. NATURE GENETICS, 2008, 40 (05) : 592 - 599