Methods for the detection and assembly of novel sequence in high-throughput sequencing data

被引:17
作者
Holtgrewe, Manuel [1 ]
Kuchenbecker, Leon [1 ,2 ]
Reinert, Knut [1 ]
机构
[1] Free Univ Berlin, Dept Comp Sci, Berlin, Germany
[2] Max Planck Inst Mol Genet, D-14195 Berlin, Germany
关键词
DE-NOVO; STRUCTURAL VARIATION; EFFICIENT; ALGORITHM; ALIGNMENT; GENOMES; FORMAT;
D O I
10.1093/bioinformatics/btv051
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Large insertions of novel sequence are an important type of structural variants. Previous studies used traditional de novo assemblers for assembling non-mapping high-throughput sequencing (HTS) or capillary reads and then tried to anchor them in the reference using paired read information. Results: We present approaches for detecting insertion breakpoints and targeted assembly of large insertions from HTS paired data: BASIL and ANISE. On near identity repeats that are hard for assemblers, ANISE employs a repeat resolution step. This results in far better reconstructions than obtained by the compared methods. On simulated data, we found our insert assembler to be competitive with the de novo assemblers ABYSS and SGA while yielding already anchored inserted sequence as opposed to unanchored contigs as from ABYSS/SGA. On real-world data, we detected novel sequence in a human individual and thoroughly validated the assembled sequence. ANISE was found to be superior to the competing tool MindTheGap on both simulated and real-world data.
引用
收藏
页码:1904 / 1912
页数:9
相关论文
共 34 条
[1]   APPLICATIONS OF NEXT-GENERATION SEQUENCING Genome structural variation discovery and genotyping [J].
Alkan, Can ;
Coe, Bradley P. ;
Eichler, Evan E. .
NATURE REVIEWS GENETICS, 2011, 12 (05) :363-375
[2]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[3]   ReAligner: A program for refining DNA sequence multi-alignments [J].
Anson, EL ;
Myers, EW .
JOURNAL OF COMPUTATIONAL BIOLOGY, 1997, 4 (03) :369-383
[4]   The haplotyping problem: An overview of computational models and solutions [J].
Bonizzoni, P ;
Della Vedova, G ;
Dondi, R ;
Li, J .
JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2003, 18 (06) :675-688
[5]  
Chevreux B., 2005, THESIS RUPRECHTS KAR
[6]   The variant call format and VCFtools [J].
Danecek, Petr ;
Auton, Adam ;
Abecasis, Goncalo ;
Albers, Cornelis A. ;
Banks, Eric ;
DePristo, Mark A. ;
Handsaker, Robert E. ;
Lunter, Gerton ;
Marth, Gabor T. ;
Sherry, Stephen T. ;
McVean, Gilean ;
Durbin, Richard .
BIOINFORMATICS, 2011, 27 (15) :2156-2158
[7]   Repetitive Elements May Comprise Over Two-Thirds of the Human Genome [J].
de Koning, A. P. Jason ;
Gu, Wanjun ;
Castoe, Todd A. ;
Batzer, Mark A. ;
Pollock, David D. .
PLOS GENETICS, 2011, 7 (12)
[8]   LEMON - an Open Source C++ Graph Template Library [J].
Dezso, Balazs ;
Juttner, Alpar ;
Kovacs, Peter .
ELECTRONIC NOTES IN THEORETICAL COMPUTER SCIENCE, 2011, 264 (05) :23-45
[9]   A DECOMPOSITION THEOREM FOR PARTIALLY ORDERED SETS [J].
DILWORTH, RP .
ANNALS OF MATHEMATICS, 1950, 51 (01) :161-166
[10]   SeqAn An efficient, generic C++ library for sequence analysis [J].
Doering, Andreas ;
Weese, David ;
Rausch, Tobias ;
Reinert, Knut .
BMC BIOINFORMATICS, 2008, 9 (1)