Iterative Dictionary Construction for Compression of Large DNA Data Sets

被引:55
作者
Kuruppu, Shanika [1 ]
Beresford-Smith, Bryan [2 ]
Conway, Thomas [2 ]
Zobel, Justin [1 ]
机构
[1] Univ Melbourne, Dept Comp Sci & Software Engn, Parkville, Vic 3010, Australia
[2] Univ Melbourne, Natl ICT Australia, Parkville, Vic 3010, Australia
基金
澳大利亚研究理事会;
关键词
Dictionary construction; compression; DNA; large data sets; GENOME SEQUENCE; ALGORITHM;
D O I
10.1109/TCBB.2011.82
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Genomic repositories increasingly include individual as well as reference sequences, which tend to share long identical and near-identical strings of nucleotides. However, the sequential processing used by most compression algorithms, and the volumes of data involved, mean that these long-range repetitions are not detected. An order-insensitive, disk-based dictionary construction method can detect this repeated content and use it to compress collections of sequences. We explore a dictionary construction method that improves repeat identification in large DNA data sets. Our adaptation, COMRAD, of an existing disk-based method identifies exact repeated content in collections of sequences with similarities within and across the set of input sequences. COMRAD compresses the data over multiple passes, which is an expensive process, but allows COMRAD to compress large data sets within reasonable time and space. COMRAD allows for random access to individual sequences and subsequences without decompressing the whole data set. COMRAD has no competitor in terms of the size of data sets that it can compress (extending to many hundreds of gigabytes) and, even for smaller data sets, the results are competitive compared to alternatives; as an example, 39 S. cerevisiae genomes compressed to 0.25 bits per base.
引用
收藏
页码:137 / 149
页数:13
相关论文
共 37 条
[1]   The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group [J].
Ahn, Sung-Min ;
Kim, Tae-Hyung ;
Lee, Sunghoon ;
Kim, Deokhoon ;
Ghang, Ho ;
Kim, Dae-Soo ;
Kim, Byoung-Chul ;
Kim, Sang-Yoon ;
Kim, Woo-Yeon ;
Kim, Chulhong ;
Park, Daeui ;
Lee, Yong Seok ;
Kim, Sangsoo ;
Reja, Rohit ;
Jho, Sungwoong ;
Kim, Chang Geun ;
Cha, Ji-Young ;
Kim, Kyung-Hee ;
Lee, Bonghee ;
Bhak, Jong ;
Kim, Seong-Jin .
GENOME RESEARCH, 2009, 19 (09) :1622-1629
[2]  
[Anonymous], 2011, P 34 AUSTR COMP SCI
[3]  
Apostolico A., 2000, Proceedings DCC 2000. Data Compression Conference, P143, DOI 10.1109/DCC.2000.838154
[4]  
Behzadi B, 2005, LECT NOTES COMPUT SC, V3537, P190
[5]   Accurate whole human genome sequencing using reversible terminator chemistry [J].
Bentley, David R. ;
Balasubramanian, Shankar ;
Swerdlow, Harold P. ;
Smith, Geoffrey P. ;
Milton, John ;
Brown, Clive G. ;
Hall, Kevin P. ;
Evers, Dirk J. ;
Barnes, Colin L. ;
Bignell, Helen R. ;
Boutell, Jonathan M. ;
Bryant, Jason ;
Carter, Richard J. ;
Cheetham, R. Keira ;
Cox, Anthony J. ;
Ellis, Darren J. ;
Flatbush, Michael R. ;
Gormley, Niall A. ;
Humphray, Sean J. ;
Irving, Leslie J. ;
Karbelashvili, Mirian S. ;
Kirk, Scott M. ;
Li, Heng ;
Liu, Xiaohai ;
Maisinger, Klaus S. ;
Murray, Lisa J. ;
Obradovic, Bojan ;
Ost, Tobias ;
Parkinson, Michael L. ;
Pratt, Mark R. ;
Rasolonjatovo, Isabelle M. J. ;
Reed, Mark T. ;
Rigatti, Roberto ;
Rodighiero, Chiara ;
Ross, Mark T. ;
Sabot, Andrea ;
Sankar, Subramanian V. ;
Scally, Aylwyn ;
Schroth, Gary P. ;
Smith, Mark E. ;
Smith, Vincent P. ;
Spiridou, Anastassia ;
Torrance, Peta E. ;
Tzonev, Svilen S. ;
Vermaas, Eric H. ;
Walter, Klaudia ;
Wu, Xiaolin ;
Zhang, Lu ;
Alam, Mohammed D. ;
Anastasi, Carole .
NATURE, 2008, 456 (7218) :53-59
[6]   Data structures and compression algorithms for genomic sequence data [J].
Brandon, Marty C. ;
Wallace, Douglas C. ;
Baldi, Pierre .
BIOINFORMATICS, 2009, 25 (14) :1731-1738
[7]  
Cannane A, 2001, J AM SOC INF SCI TEC, V52, P430, DOI 10.1002/1532-2890(2001)9999:9999<::AID-ASI1084>3.0.CO
[8]  
2-Z
[9]  
Cao MD, 2007, IEEE DATA COMPR CONF, P43
[10]   The smallest grammar problem [J].
Charikar, M ;
Lehman, E ;
Liu, D ;
Panigrahy, R ;
Prabhakaran, M ;
Sahai, A ;
Shelat, A .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2005, 51 (07) :2554-2576