Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies

被引:47
作者
Giancarlo, Raffaele [1 ]
Rombo, Simona E. [1 ]
Utro, Filippo [2 ]
机构
[1] Univ Palermo, Dipartimento Matemat & Informat, Palermo, Italy
[2] IBM TJ Watson Res Ctr, Computat Genom Grp, Yorktown Hts, NY USA
关键词
data compression of large sequence collections; data compression in bioinformatics; storage and management of HTS data; compressive sequence analysis; analysis of large biological sequence collections; succinct data structures for bioinformatics; TEXTUAL DATA-COMPRESSION; GENOMIC SEQUENCE; QUALITY SCORES; COMPUTATIONAL BIOLOGY; LOCAL ALIGNMENT; ALGORITHMS; FORMAT; BURROWS; CONSTRUCTION; COMPLEXITY;
D O I
10.1093/bib/bbt088
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
High-throughput sequencing technologies produce large collections of data, mainly DNA sequences with additional information, requiring the design of efficient and effective methodologies for both their compression and storage. In this context, we first provide a classification of the main techniques that have been proposed, according to three specific research directions that have emerged from the literature and, for each, we provide an overview of the current techniques. Finally, to make this review useful to researchers and technicians applying the existing software and tools, we include a synopsis of the main characteristics of the described approaches, including details on their implementation and availability. Performance of the various methods is also highlighted, although the state of the art does not lend itself to a consistent and coherent comparison among all the methods presented here.
引用
收藏
页码:390 / 406
页数:17
相关论文
共 89 条
[1]  
Afify Heba, 2011, International Journal of Computer Science & Information Technology, V3, P145, DOI 10.5121/ijcsit.2011.3412
[2]   THE INPUT OUTPUT COMPLEXITY OF SORTING AND RELATED PROBLEMS [J].
AGGARWAL, A ;
VITTER, JS .
COMMUNICATIONS OF THE ACM, 1988, 31 (09) :1116-1127
[3]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[4]   Image Compression by 2D Motif Basis [J].
Amelio, Alessia ;
Apostolico, Alberto ;
Rombo, Simona E. .
2011 DATA COMPRESSION CONFERENCE (DCC), 2011, :153-162
[5]   Let sleeping files lie: Pattern matching in Z-compressed files [J].
Amir, A ;
Benson, G ;
Farach, M .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1996, 52 (02) :299-307
[6]  
[Anonymous], 2001, The JPEG-2000 Still Image Compression Standard (Last Revised June 30, 2001)"
[7]  
[Anonymous], 2011, P 34 AUSTR COMP SCI
[8]  
Bauer Markus J., 2012, Algorithms in Bioinformatics. Proceedings of the12th International Workshop, WABI 2012, P326, DOI 10.1007/978-3-642-33122-0_26
[9]   Lightweight algorithms for constructing and inverting the BWT of string collections [J].
Bauer, Markus J. ;
Cox, Anthony J. ;
Rosone, Giovanna .
THEORETICAL COMPUTER SCIENCE, 2013, 483 :134-148
[10]   No-Reference Compression of Genomic Data Stored In FASTQ Format [J].
Bhola, Vishal ;
Bopardikar, Ajit S. ;
Narayanan, Rangavittal ;
Lee, Kyusang ;
Ahn, TaeJin .
2011 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM 2011), 2011, :147-150