Readjoiner: a fast and memory efficient string graph-based sequence assembler

被引:40
作者
Gonnella, Giorgio [1 ]
Kurtz, Stefan [1 ]
机构
[1] Univ Hamburg, Ctr Bioinformat, D-20146 Hamburg, Germany
关键词
LARGE GENOMES;
D O I
10.1186/1471-2105-13-82
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Ongoing improvements in throughput of the next-generation sequencing technologies challenge the current generation of de novo sequence assemblers. Most recent sequence assemblers are based on the construction of a de Bruijn graph. An alternative framework of growing interest is the assembly string graph, not necessitating a division of the reads into k-mers, but requiring fast algorithms for the computation of suffix-prefix matches among all pairs of reads. Results: Here we present efficient methods for the construction of a string graph from a set of sequencing reads. Our approach employs suffix sorting and scanning methods to compute suffix-prefix matches. Transitive edges are recognized and eliminated early in the process and the graph is efficiently constructed including irreducible edges only. Conclusions: Our suffix-prefix match determination and string graph construction algorithms have been implemented in the software package Readjoiner. Comparison with existing string graph-based assemblers shows that Readjoiner is faster and more space efficient. Readjoiner is available at http://www.zbh.uni-hamburg.de/readjoiner.
引用
收藏
页数:19
相关论文
共 30 条
[11]  
Earl D.A., 2011, Genome Research
[12]   The string B-tree: A new data structure for string search in external memory and its applications [J].
Ferragina, P ;
Grossi, R .
JOURNAL OF THE ACM, 1999, 46 (02) :236-280
[13]   TRIE MEMORY [J].
FREDKIN, E .
COMMUNICATIONS OF THE ACM, 1960, 3 (09) :490-499
[14]   AN EFFICIENT ALGORITHM FOR THE ALL PAIRS SUFFIX PREFIX PROBLEM [J].
GUSFIELD, D ;
LANDAU, GM ;
SCHIEBER, B .
INFORMATION PROCESSING LETTERS, 1992, 41 (04) :181-185
[15]   De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer [J].
Hernandez, David ;
Francois, Patrice ;
Farinelli, Laurent ;
Osteras, Magne ;
Schrenzel, Jacques .
GENOME RESEARCH, 2008, 18 (05) :802-809
[16]  
Kärkkäinen J, 2008, LECT NOTES COMPUT SC, V5280, P3
[17]   Quake: quality-aware detection and correction of sequencing errors [J].
Kelley, David R. ;
Schatz, Michael C. ;
Salzberg, Steven L. .
GENOME BIOLOGY, 2010, 11 (11)
[18]   Versatile and open software for comparing large genomes [J].
Kurtz, S ;
Phillippy, A ;
Delcher, AL ;
Smoot, M ;
Shumway, M ;
Antonescu, C ;
Salzberg, SL .
GENOME BIOLOGY, 2004, 5 (02)
[19]   SUFFIX ARRAYS - A NEW METHOD FOR ONLINE STRING SEARCHES [J].
MANBER, U ;
MYERS, G .
SIAM JOURNAL ON COMPUTING, 1993, 22 (05) :935-948
[20]  
McPherson JD, 2009, NAT METHODS, V6, pS2, DOI [10.1038/nmeth.f.268, 10.1038/NMETH.F.268]