A Parallel Algorithm for Compression of Big Next-Generation Sequencing Datasets

被引:1
作者
Perez, Sandino Vargas [1 ]
Saeed, Fahad [2 ]
机构
[1] Western Michigan Univ, Dept Comp Sci, Kalamazoo, MI 49008 USA
[2] Western Michigan Univ, Dept Elect & Comp Engn, Kalamazoo, MI 49008 USA
来源
2015 IEEE TRUSTCOM/BIGDATASE/ISPA, VOL 3 | 2015年
关键词
Next-Generation Sequencing; parallel implementation; DSRC; MPI; big data; FASTQ; FASTQ; FORMAT;
D O I
10.1109/Trustcom.2015.632
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The amount of big data from high-throughput Next-Generation Sequencing (NGS) techniques represents various challenges such as storage, analysis and transmission of massive datasets. One solution to storage and transmission of data is compression using specialized compression algorithms. The existing specialized algorithms suffer from poor scalability with increasing size of the datasets and best available solutions can take hours to compress gigabytes of data. Compression and decompression using these techniques for peta-scale data sets is prohibitively expensive in terms of time and energy. In this paper we introduce paraDSRC, a parallel implementation of the DNA Sequence Reads Compression (DSRC) application using a message passing model that presents reduction of the compression time complexity by a factor of O(1/p) (where p is the number of processing units). Our experimental results show that paraDSRC achieves compression times that are 43% to 99% faster than DSRC and compression throughputs of up to 8.4GB/s on a moderate size cluster. For many of the datasets used in our experiments super-linear speedups have been registered making the implementation strongly scalable. We also show that paraDSRC is more than 25.6x faster than comparable parallel compression algorithms.
引用
收藏
页码:196 / 201
页数:6
相关论文
共 20 条
  • [1] [Anonymous], INTRO PARALLEL COMPU
  • [2] Benz JK, 2009, PROCEEDINGS OF THE 4TH INTERNATIONAL TOPICAL MEETING ON HIGH TEMPERATURE REACTOR TECHNOLOGY - 2008, VOL 2, P91
  • [3] Noncontiguous I/O accesses through MPI-IO
    Ching, A
    Choudhary, A
    Coloma, K
    Liao, WK
    Ross, R
    Gropp, W
    [J]. CCGRID 2003: 3RD IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, PROCEEDINGS, 2003, : 104 - 111
  • [4] The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants
    Cock, Peter J. A.
    Fields, Christopher J.
    Goto, Naohisa
    Heuer, Michael L.
    Rice, Peter M.
    [J]. NUCLEIC ACIDS RESEARCH, 2010, 38 (06) : 1767 - 1771
  • [5] Compression of DNA sequence reads in FASTQ format
    Deorowicz, Sebastian
    Grabowski, Szymon
    [J]. BIOINFORMATICS, 2011, 27 (06) : 860 - 862
  • [6] Dickens PM, 2009, HPDC'09: 18TH ACM INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, P31
  • [7] Grama A. Y., 1993, IEEE Parallel & Distributed Technology: Systems & Applications, V1, P12, DOI 10.1109/88.242438
  • [8] KungFQ: A Simple and Powerful Approach to Compress fastq Files
    Grassi, Elena
    Di Gregorio, Federico
    Molineris, Ivan
    [J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2012, 9 (06) : 1837 - 1842
  • [9] High-Throughput Compression of FASTQ Data with SeqDB
    Howison, Mark
    [J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2013, 10 (01) : 213 - 218
  • [10] SOLiDzipper: A High Speed Encoding Method for the Next-Generation Sequencing Data
    Jeon, Young Jun
    Park, Sang Hyun
    Ahn, Sung Min
    Hwang, Hee Joung
    [J]. EVOLUTIONARY BIOINFORMATICS, 2011, 7 : 1 - 6