Succinct data structures for assembling large genomes

被引:84
作者
Conway, Thomas C. [1 ]
Bromage, Andrew J. [1 ]
机构
[1] Univ Melbourne, Dept Comp Sci & Engn, NICTA Victoria Res Lab, Parkville, Vic 3052, Australia
基金
澳大利亚研究理事会;
关键词
D O I
10.1093/bioinformatics/btq697
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Second-generation sequencing technology makes it feasible for many researches to obtain enough sequence reads to attempt the de novo assembly of higher eukaryotes (including mammals). De novo assembly not only provides a tool for understanding wide scale biological variation, but within human biomedicine, it offers a direct way of observing both large-scale structural variation and fine-scale sequence variation. Unfortunately, improvements in the computational feasibility for de novo assembly have not matched the improvements in the gathering of sequence data. This is for two reasons: the inherent computational complexity of the problem and the in-practice memory requirements of tools. Results: In this article, we use entropy compressed or succinct data structures to create a practical representation of the de Bruijn assembly graph, which requires at least a factor of 10 less storage than the kinds of structures used by deployed methods. Moreover, because our representation is entropy compressed, in the presence of sequencing errors it has better scaling behaviour asymptotically than conventional approaches. We present results of a proof-of-concept assembly of a human genome performed on a modest commodity server.
引用
收藏
页码:479 / 486
页数:8
相关论文
共 25 条
[1]  
[Anonymous], 2002, Compression and Coding Algorithms
[2]  
ARROYUELO D, 2010, DATA ENG ICDE 2010 I, P417
[3]   Accurate whole human genome sequencing using reversible terminator chemistry [J].
Bentley, David R. ;
Balasubramanian, Shankar ;
Swerdlow, Harold P. ;
Smith, Geoffrey P. ;
Milton, John ;
Brown, Clive G. ;
Hall, Kevin P. ;
Evers, Dirk J. ;
Barnes, Colin L. ;
Bignell, Helen R. ;
Boutell, Jonathan M. ;
Bryant, Jason ;
Carter, Richard J. ;
Cheetham, R. Keira ;
Cox, Anthony J. ;
Ellis, Darren J. ;
Flatbush, Michael R. ;
Gormley, Niall A. ;
Humphray, Sean J. ;
Irving, Leslie J. ;
Karbelashvili, Mirian S. ;
Kirk, Scott M. ;
Li, Heng ;
Liu, Xiaohai ;
Maisinger, Klaus S. ;
Murray, Lisa J. ;
Obradovic, Bojan ;
Ost, Tobias ;
Parkinson, Michael L. ;
Pratt, Mark R. ;
Rasolonjatovo, Isabelle M. J. ;
Reed, Mark T. ;
Rigatti, Roberto ;
Rodighiero, Chiara ;
Ross, Mark T. ;
Sabot, Andrea ;
Sankar, Subramanian V. ;
Scally, Aylwyn ;
Schroth, Gary P. ;
Smith, Mark E. ;
Smith, Vincent P. ;
Spiridou, Anastassia ;
Torrance, Peta E. ;
Tzonev, Svilen S. ;
Vermaas, Eric H. ;
Walter, Klaudia ;
Wu, Xiaolin ;
Zhang, Lu ;
Alam, Mohammed D. ;
Anastasi, Carole .
NATURE, 2008, 456 (7218) :53-59
[4]  
Brisaboa NR, 2009, LECT NOTES COMPUT SC, V5721, P122, DOI 10.1007/978-3-642-03784-9_12
[5]  
Claude F, 2007, LECT NOTES COMPUT SC, V4726, P118
[6]  
Claude F, 2008, LECT NOTES COMPUT SC, V5280, P176
[7]   EFFICIENT STORAGE AND RETRIEVAL BY CONTENT AND ADDRESS OF STATIC FILES [J].
ELIAS, P .
JOURNAL OF THE ACM, 1974, 21 (02) :246-260
[8]  
Fotakis D, 2003, LECT NOTES COMPUT SC, V2607, P271
[9]  
Fox S, 2009, METHODS MOL BIOL, V553, P79, DOI 10.1007/978-1-60327-563-7_5
[10]  
Good I.J., 1946, J. London Math. Soc, V21, P167