A Parallel Algorithm for Compression of Big Next-Generation Sequencing Datasets

被引：1

作者：

Perez, Sandino Vargas ^{[1
]}

Saeed, Fahad ^{[2
]}

机构：

[1] Western Michigan Univ, Dept Comp Sci, Kalamazoo, MI 49008 USA

[2] Western Michigan Univ, Dept Elect & Comp Engn, Kalamazoo, MI 49008 USA

来源：

2015 IEEE TRUSTCOM/BIGDATASE/ISPA, VOL 3 | 2015年

关键词：

Next-Generation Sequencing; parallel implementation; DSRC; MPI; big data; FASTQ; FASTQ; FORMAT;

D O I：

10.1109/Trustcom.2015.632

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The amount of big data from high-throughput Next-Generation Sequencing (NGS) techniques represents various challenges such as storage, analysis and transmission of massive datasets. One solution to storage and transmission of data is compression using specialized compression algorithms. The existing specialized algorithms suffer from poor scalability with increasing size of the datasets and best available solutions can take hours to compress gigabytes of data. Compression and decompression using these techniques for peta-scale data sets is prohibitively expensive in terms of time and energy. In this paper we introduce paraDSRC, a parallel implementation of the DNA Sequence Reads Compression (DSRC) application using a message passing model that presents reduction of the compression time complexity by a factor of O(1/p) (where p is the number of processing units). Our experimental results show that paraDSRC achieves compression times that are 43% to 99% faster than DSRC and compression throughputs of up to 8.4GB/s on a moderate size cluster. For many of the datasets used in our experiments super-linear speedups have been registered making the implementation strongly scalable. We also show that paraDSRC is more than 25.6x faster than comparable parallel compression algorithms.

引用

页码：196 / 201

页数：6

共 20 条

[1] [Anonymous], INTRO PARALLEL COMPU
[2] Benz JK, 2009, PROCEEDINGS OF THE 4TH INTERNATIONAL TOPICAL MEETING ON HIGH TEMPERATURE REACTOR TECHNOLOGY - 2008, VOL 2, P91
[3] Noncontiguous I/O accesses through MPI-IO
Ching, A
Choudhary, A
Coloma, K
Liao, WK
Ross, R
Gropp, W
[J]. CCGRID 2003: 3RD IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, PROCEEDINGS, 2003, : 104 - 111
[4] The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants
Cock, Peter J. A.
Fields, Christopher J.
Goto, Naohisa
Heuer, Michael L.
Rice, Peter M.
[J]. NUCLEIC ACIDS RESEARCH, 2010, 38 (06) : 1767 - 1771
[5] Compression of DNA sequence reads in FASTQ format
Deorowicz, Sebastian
Grabowski, Szymon
[J]. BIOINFORMATICS, 2011, 27 (06) : 860 - 862
[6] Dickens PM, 2009, HPDC'09: 18TH ACM INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, P31
[7] Grama A. Y., 1993, IEEE Parallel & Distributed Technology: Systems & Applications, V1, P12, DOI 10.1109/88.242438
[8] KungFQ: A Simple and Powerful Approach to Compress fastq Files
Grassi, Elena
Di Gregorio, Federico
Molineris, Ivan
[J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2012, 9 (06) : 1837 - 1842
[9] High-Throughput Compression of FASTQ Data with SeqDB
Howison, Mark
[J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2013, 10 (01) : 213 - 218
[10] SOLiDzipper: A High Speed Encoding Method for the Next-Generation Sequencing Data
Jeon, Young Jun
Park, Sang Hyun
Ahn, Sung Min
Hwang, Hee Joung
[J]. EVOLUTIONARY BIOINFORMATICS, 2011, 7 : 1 - 6

← 1 2 →