FRESCO: Referential Compression of Highly Similar Sequences

被引:55
作者
Wandelt, Sebastian [1 ]
Leser, Ulf [1 ]
机构
[1] Humboldt Univ, Knowledge Management Bioinformat Grp, D-10099 Berlin, Germany
关键词
Sequences; referential compression; second-order compression; compression heuristics; GENOMES; ALGORITHM;
D O I
10.1109/TCBB.2013.122
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
In many applications, sets of similar texts or sequences are of high importance. Prominent examples are revision histories of documents or genomic sequences. Modern high-throughput sequencing technologies are able to generate DNA sequences at an ever-increasing rate. In parallel to the decreasing experimental time and cost necessary to produce DNA sequences, computational requirements for analysis and storage of the sequences are steeply increasing. Compression is a key technology to deal with this challenge. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. In this paper, we propose a general open-source framework to compress large amounts of biological sequence data called Framework for REferential Sequence COmpression (FRESCO). Our basic compression algorithm is shown to be one to two orders of magnitudes faster than comparable related work, while achieving similar compression ratios. We also propose several techniques to further increase compression ratios, while still retaining the advantage in speed: 1) selecting a good reference sequence; and 2) rewriting a reference sequence to allow for better compression. In addition, we propose a new way of further boosting the compression ratios by applying referential compression to already referentially compressed files (second-order compression). This technique allows for compression ratios way beyond state of the art, for instance, 4,000: 1 and higher for human genomes. We evaluate our algorithms on a large data set from three different species (more than 1,000 genomes, more than 3 TB) and on a collection of versions of Wikipedia pages. Our results show that real-time compression of highly similar sequences at high compression ratios is possible on modern hardware.
引用
收藏
页码:1275 / 1288
页数:14
相关论文
共 45 条
[31]   Data Compression Concepts and Algorithms and Their Applications to Bioinformatics [J].
Nalbantoglu, Oezkan U. ;
Russell, David J. ;
Sayood, Khalid .
ENTROPY, 2010, 12 (01) :34-52
[32]  
Ohlebusch E, 2010, LECT NOTES COMPUT SC, V6393, P322, DOI 10.1007/978-3-642-16321-0_34
[33]   Will Computers Crash Genomics? [J].
Pennisi, Elizabeth .
SCIENCE, 2011, 331 (6018) :666-668
[34]   GReEn: a tool for efficient compression of genome resequencing data [J].
Pinho, Armando J. ;
Pratas, Diogo ;
Garcia, Sara P. .
NUCLEIC ACIDS RESEARCH, 2012, 40 (04) :e27
[35]  
Pratas D, 2011, ADV INTEL SOFT COMPU, V93, P213
[36]   A window into third-generation sequencing [J].
Schadt, Eric E. ;
Turner, Steve ;
Kasarskis, Andrew .
HUMAN MOLECULAR GENETICS, 2010, 19 :R227-R240
[37]   Cloud computing and the DNA data race [J].
Schatz, Michael C. ;
Langmead, Ben ;
Salzberg, Steven L. .
NATURE BIOTECHNOLOGY, 2010, 28 (07) :691-693
[38]  
Shibata Y, 2000, LECT NOTES COMPUT SC, V1848, P181
[39]   The case for cloud computing in genome informatics [J].
Stein, Lincoln D. .
GENOME BIOLOGY, 2010, 11 (05)
[40]   Big data, but are we ready? [J].
Trelles, Oswaldo ;
Prins, Pjotr ;
Snir, Marc ;
Jansen, Ritsert C. .
NATURE REVIEWS GENETICS, 2011, 12 (03) :224-224