Bitpacking techniques for indexing genomes: I. Hash tables

被引:4
作者
Wu, Thomas D. [1 ]
机构
[1] Genentech Inc, Dept Bioinformat & Computat Biol, 1 DNA Way, San Francisco, CA 94080 USA
关键词
Hash table; Sequence alignment; Genomics; Data compression; COMPRESSION ALGORITHMS; SPEED-UP; GMAP;
D O I
10.1186/s13015-016-0069-5
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could facilitate rapid alignment of sequences to a genome. Results: We propose to compress the offset array using vectorized bitpacking. We introduce an algorithm and data structure called BP64-columnar that achieves fast random access in arrays of monotonically nondecreasing integers. Experimental results based on hash tables for the fly, chicken, and human genomes show that BP64-columnar is 3 to 4 times faster than publicly available implementations of universal coding schemes, such as Elias gamma, Elias delta, and Fibonacci compression. Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values. Conclusions: Our BP64-columnar scheme enables compression of genomic hash tables with fast retrieval. It also has potential applications to other domains requiring differential coding with random access.
引用
收藏
页数:13
相关论文
共 33 条
[1]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[2]  
[Anonymous], 2009, Proceedings of the VLDB Endowment, DOI DOI 10.14778/1687627.1687671
[3]   Data structures and compression algorithms for genomic sequence data [J].
Brandon, Marty C. ;
Wallace, Douglas C. ;
Baldi, Pierre .
BIOINFORMATICS, 2009, 25 (14) :1731-1738
[4]   Data structures and compression algorithms for high-throughput sequencing technologies [J].
Daily, Kenny ;
Rigor, Paul ;
Christley, Scott ;
Xie, Xiaohui ;
Baldi, Pierre .
BMC BIOINFORMATICS, 2010, 11
[5]   Data compression for sequencing data [J].
Deorowicz, Sebastian ;
Grabowski, Szymon .
ALGORITHMS FOR MOLECULAR BIOLOGY, 2013, 8
[6]   Accelerated Profile HMM Searches [J].
Eddy, Sean R. .
PLOS COMPUTATIONAL BIOLOGY, 2011, 7 (10)
[7]  
ELIAS P, 1975, IEEE T INFORM THEORY, V21, P194, DOI 10.1109/TIT.1975.1055349
[8]   Striped Smith-Waterman speeds database searches six times over other SIMD implementations [J].
Farrar, Michael .
BIOINFORMATICS, 2007, 23 (02) :156-161
[9]   Robust universal complete codes for transmission and compression [J].
Fraenkel, AS ;
Klein, ST .
DISCRETE APPLIED MATHEMATICS, 1996, 64 (01) :31-55
[10]   FusionMap: detecting fusion genes from next-generation sequencing data at base-pair resolution [J].
Ge, Huanying ;
Liu, Kejun ;
Juan, Todd ;
Fang, Fang ;
Newman, Matthew ;
Hoeck, Wolfgang .
BIOINFORMATICS, 2011, 27 (14) :1922-1928