Dindel: Accurate indel calls from short-read data

被引:302
作者
Albers, Cornelis A. [1 ,2 ,3 ]
Lunter, Gerton [4 ]
MacArthur, Daniel G. [1 ]
McVean, Gilean [5 ]
Ouwehand, Willem H. [1 ,2 ,3 ]
Durbin, Richard [1 ]
机构
[1] Wellcome Trust Sanger Inst, Hinxton CB10 1HH, Cambs, England
[2] Univ Cambridge, Dept Haematol, Cambridge CB2 1TN, England
[3] Natl Hlth Serv Blood & Transplant, Cambridge CB2 1TN, England
[4] Wellcome Trust Ctr Human Genet, Oxford OX3 7BN, England
[5] Univ Oxford, Dept Stat, Oxford OX1 3TG, England
基金
英国惠康基金;
关键词
HUMAN GENOME; SEQUENCE DATA; ALIGNMENT; ELEMENTS; PROJECT; MUSCLE; MAP;
D O I
10.1101/gr.112326.110
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Small insertions and deletions (indels) are a common and functionally important type of sequence polymorphism. Most of the focus of studies of sequence variation is on single nucleotide variants (SNVs) and large structural variants. In principle, high-throughput sequencing studies should allow identification of indels just as SNVs. However, inference of indels from next-generation sequence data is challenging, and so far methods for identifying indels lag behind methods for calling SNVs in terms of sensitivity and specificity. We propose a Bayesian method to call indels from short-read sequence data in individuals and populations by realigning reads to candidate haplotypes that represent alternative sequence to the reference. The candidate haplotypes are formed by combining candidate indels and SNVs identified by the read mapper, while allowing for known sequence variants or candidates from other methods to be included. In our probabilistic realignment model we account for base-calling errors, mapping errors, and also, importantly, for increased sequencing error indel rates in long homopolymer runs. We show that our method is sensitive and achieves low false discovery rates on simulated and real data sets, although challenges remain. The algorithm is implemented in the program Dindel, which has been used in the 1000 Genomes Project call sets.
引用
收藏
页码:961 / 973
页数:13
相关论文
共 31 条
  • [1] A map of human genome variation from population-scale sequencing
    Altshuler, David
    Durbin, Richard M.
    Abecasis, Goncalo R.
    Bentley, David R.
    Chakravarti, Aravinda
    Clark, Andrew G.
    Collins, Francis S.
    De la Vega, Francisco M.
    Donnelly, Peter
    Egholm, Michael
    Flicek, Paul
    Gabriel, Stacey B.
    Gibbs, Richard A.
    Knoppers, Bartha M.
    Lander, Eric S.
    Lehrach, Hans
    Mardis, Elaine R.
    McVean, Gil A.
    Nickerson, DebbieA.
    Peltonen, Leena
    Schafer, Alan J.
    Sherry, Stephen T.
    Wang, Jun
    Wilson, Richard K.
    Gibbs, Richard A.
    Deiros, David
    Metzker, Mike
    Muzny, Donna
    Reid, Jeff
    Wheeler, David
    Wang, Jun
    Li, Jingxiang
    Jian, Min
    Li, Guoqing
    Li, Ruiqiang
    Liang, Huiqing
    Tian, Geng
    Wang, Bo
    Wang, Jian
    Wang, Wei
    Yang, Huanming
    Zhang, Xiuqing
    Zheng, Huisong
    Lander, Eric S.
    Altshuler, David L.
    Ambrogio, Lauren
    Bloom, Toby
    Cibulskis, Kristian
    Fennell, Tim J.
    Gabriel, Stacey B.
    [J]. NATURE, 2010, 467 (7319) : 1061 - 1073
  • [2] [Anonymous], 2007, INFORM SCI STAT
  • [3] Accurate whole human genome sequencing using reversible terminator chemistry
    Bentley, David R.
    Balasubramanian, Shankar
    Swerdlow, Harold P.
    Smith, Geoffrey P.
    Milton, John
    Brown, Clive G.
    Hall, Kevin P.
    Evers, Dirk J.
    Barnes, Colin L.
    Bignell, Helen R.
    Boutell, Jonathan M.
    Bryant, Jason
    Carter, Richard J.
    Cheetham, R. Keira
    Cox, Anthony J.
    Ellis, Darren J.
    Flatbush, Michael R.
    Gormley, Niall A.
    Humphray, Sean J.
    Irving, Leslie J.
    Karbelashvili, Mirian S.
    Kirk, Scott M.
    Li, Heng
    Liu, Xiaohai
    Maisinger, Klaus S.
    Murray, Lisa J.
    Obradovic, Bojan
    Ost, Tobias
    Parkinson, Michael L.
    Pratt, Mark R.
    Rasolonjatovo, Isabelle M. J.
    Reed, Mark T.
    Rigatti, Roberto
    Rodighiero, Chiara
    Ross, Mark T.
    Sabot, Andrea
    Sankar, Subramanian V.
    Scally, Aylwyn
    Schroth, Gary P.
    Smith, Mark E.
    Smith, Vincent P.
    Spiridou, Anastassia
    Torrance, Peta E.
    Tzonev, Svilen S.
    Vermaas, Eric H.
    Walter, Klaudia
    Wu, Xiaolin
    Zhang, Lu
    Alam, Mohammed D.
    Anastasi, Carole
    [J]. NATURE, 2008, 456 (7218) : 53 - 59
  • [4] Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project
    Birney, Ewan
    Stamatoyannopoulos, John A.
    Dutta, Anindya
    Guigo, Roderic
    Gingeras, Thomas R.
    Margulies, Elliott H.
    Weng, Zhiping
    Snyder, Michael
    Dermitzakis, Emmanouil T.
    Stamatoyannopoulos, John A.
    Thurman, Robert E.
    Kuehn, Michael S.
    Taylor, Christopher M.
    Neph, Shane
    Koch, Christoph M.
    Asthana, Saurabh
    Malhotra, Ankit
    Adzhubei, Ivan
    Greenbaum, Jason A.
    Andrews, Robert M.
    Flicek, Paul
    Boyle, Patrick J.
    Cao, Hua
    Carter, Nigel P.
    Clelland, Gayle K.
    Davis, Sean
    Day, Nathan
    Dhami, Pawandeep
    Dillon, Shane C.
    Dorschner, Michael O.
    Fiegler, Heike
    Giresi, Paul G.
    Goldy, Jeff
    Hawrylycz, Michael
    Haydock, Andrew
    Humbert, Richard
    James, Keith D.
    Johnson, Brett E.
    Johnson, Ericka M.
    Frum, Tristan T.
    Rosenzweig, Elizabeth R.
    Karnani, Neerja
    Lee, Kirsten
    Lefebvre, Gregory C.
    Navas, Patrick A.
    Neri, Fidencio
    Parker, Stephen C. J.
    Sabo, Peter J.
    Sandstrom, Richard
    Shafer, Anthony
    [J]. NATURE, 2007, 447 (7146) : 799 - 816
  • [5] Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering
    Browning, Sharon R.
    Browning, Brian L.
    [J]. AMERICAN JOURNAL OF HUMAN GENETICS, 2007, 81 (05) : 1084 - 1097
  • [6] Problems and Solutions for Estimating Indel Rates and Length Distributions
    Cartwright, Reed A.
    [J]. MOLECULAR BIOLOGY AND EVOLUTION, 2009, 26 (02) : 473 - 480
  • [7] The gene encoding ribosomal protein S19 is mutated in Diamond-Blackfan anaemia
    Draptchinskaia, N
    Gustavsson, P
    Andersson, B
    Pettersson, M
    Willig, TN
    Dianzani, I
    Ball, S
    Tchernia, G
    Klar, J
    Matsson, H
    Tentler, D
    Mohandas, N
    Carlsson, B
    Dahl, N
    [J]. NATURE GENETICS, 1999, 21 (02) : 169 - 175
  • [8] Durbin R., 1998, Analysis, V356, DOI [10.1017/CBO9780511790492, DOI 10.1017/CBO9780511790492]
  • [9] MUSCLE: multiple sequence alignment with high accuracy and high throughput
    Edgar, RC
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 (05) : 1792 - 1797
  • [10] The ENCODE (ENCyclopedia of DNA elements) Project
    Feingold, EA
    Good, PJ
    Guyer, MS
    Kamholz, S
    Liefer, L
    Wetterstrand, K
    Collins, FS
    Gingeras, TR
    Kampa, D
    Sekinger, EA
    Cheng, J
    Hirsch, H
    Ghosh, S
    Zhu, Z
    Pate, S
    Piccolboni, A
    Yang, A
    Tammana, H
    Bekiranov, S
    Kapranov, P
    Harrison, R
    Church, G
    Struhl, K
    Ren, B
    Kim, TH
    Barrera, LO
    Qu, C
    Van Calcar, S
    Luna, R
    Glass, CK
    Rosenfeld, MG
    Guigo, R
    Antonarakis, SE
    Birney, E
    Brent, M
    Pachter, L
    Reymond, A
    Dermitzakis, ET
    Dewey, C
    Keefe, D
    Denoeud, F
    Lagarde, J
    Ashurst, J
    Hubbard, T
    Wesselink, JJ
    Castelo, R
    Eyras, E
    Myers, RM
    Sidow, A
    Batzoglou, S
    [J]. SCIENCE, 2004, 306 (5696) : 636 - 640