SOLiDzipper: A High Speed Encoding Method for the Next-Generation Sequencing Data

被引:6
作者
Jeon, Young Jun [2 ]
Park, Sang Hyun [2 ]
Ahn, Sung Min [1 ]
Hwang, Hee Joung [2 ]
机构
[1] Gachon Univ Med & Sci, Lee Gil Ya Canc & Diabet Inst, Lab Genom & Genom Med, Inchon, South Korea
[2] Gachon Univ Med & Sci, SDLAB, Inchon, South Korea
来源
EVOLUTIONARY BIOINFORMATICS | 2011年 / 7卷
关键词
bioinformatics; NGS; DNA compression; cloud computing;
D O I
10.4137/EBO.S6618
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: Next-generation sequencing (NGS) methods pose computational challenges of handling large volumes of data. Although cloud computing offers a potential solution to these challenges, transferring a large data set across the internet is the biggest obstacle, which may be overcome by efficient encoding methods. When encoding is used to facilitate data transfer to the cloud, the time factor is equally as important as the encoding efficiency. Moreover, to take advantage of parallel processing in cloud computing, a parallel technique to decode and split compressed data in the cloud is essential. Hence in this review, we present SOLiDzipper, a new encoding method for NGS data. Methods: The basic strategy of SOLiDzipper is to divide and encode. NGS data files contain both the sequence and non-sequence information whose encoding efficiencies are different. In SOLiDzipper, encoded data are stored in binary data block that does not contain the characteristic information of a specific sequence platform, which means that data can be decoded according to a desired platform even in cases of Illumina, Solexa or Roche 454 data. Results: The main calculation time using Crossbow was 173 minutes when 40 EC2 nodes were involved. In that case, an analysis preparation time of 464 minutes is required to encode data in the latest DNA compression method like G-SQZ and transmit it on a 183 Mbit/s bandwidth. However, it takes 194 minutes to encode and transmit data with SOLiDzipper under the same bandwidth conditions. These results indicate that the entire processing time can be reduced according to the encoding methods used, under the same network bandwidth conditions. Considering the limited network bandwidth, high-speed, high-efficiency encoding methods such as SOLiDzipper can make a significant contribution to higher productivity in labs seeking to take advantage of the cloud as an alternative to local computing.
引用
收藏
页码:1 / 6
页数:6
相关论文
共 12 条
  • [1] ADJEROH, 2002, P IEEE COMP SOC BIOI
  • [2] The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group
    Ahn, Sung-Min
    Kim, Tae-Hyung
    Lee, Sunghoon
    Kim, Deokhoon
    Ghang, Ho
    Kim, Dae-Soo
    Kim, Byoung-Chul
    Kim, Sang-Yoon
    Kim, Woo-Yeon
    Kim, Chulhong
    Park, Daeui
    Lee, Yong Seok
    Kim, Sangsoo
    Reja, Rohit
    Jho, Sungwoong
    Kim, Chang Geun
    Cha, Ji-Young
    Kim, Kyung-Hee
    Lee, Bonghee
    Bhak, Jong
    Kim, Seong-Jin
    [J]. GENOME RESEARCH, 2009, 19 (09) : 1622 - 1629
  • [3] Data structures and compression algorithms for genomic sequence data
    Brandon, Marty C.
    Wallace, Douglas C.
    Baldi, Pierre
    [J]. BIOINFORMATICS, 2009, 25 (14) : 1731 - 1738
  • [4] DNACompress: fast and effective DNA sequence compression
    Chen, X
    Li, M
    Ma, B
    Tromp, J
    [J]. BIOINFORMATICS, 2002, 18 (12) : 1696 - 1698
  • [5] A METHOD FOR THE CONSTRUCTION OF MINIMUM-REDUNDANCY CODES
    HUFFMAN, DA
    [J]. PROCEEDINGS OF THE INSTITUTE OF RADIO ENGINEERS, 1952, 40 (09): : 1098 - 1101
  • [6] Cost-Effective Cloud Computing: A Case Study Using the Comparative Genomics Tool, Roundup
    Kudtarkar, Parul
    DeLuca, Todd F.
    Fusaro, Vincent A.
    Tonellato, Peter J.
    Wall, Dennis P.
    [J]. EVOLUTIONARY BIOINFORMATICS, 2010, 6 : 197 - 203
  • [7] Searching for SNPs with cloud computing
    Langmead, Ben
    Schatz, Michael C.
    Lin, Jimmy
    Pop, Mihai
    Salzberg, Steven L.
    [J]. GENOME BIOLOGY, 2009, 10 (11):
  • [8] The Sequence Alignment/Map format and SAMtools
    Li, Heng
    Handsaker, Bob
    Wysoker, Alec
    Fennell, Tim
    Ruan, Jue
    Homer, Nils
    Marth, Gabor
    Abecasis, Goncalo
    Durbin, Richard
    [J]. BIOINFORMATICS, 2009, 25 (16) : 2078 - 2079
  • [9] APPLICATIONS OF NEXT-GENERATION SEQUENCING Sequencing technologies - the next generation
    Metzker, Michael L.
    [J]. NATURE REVIEWS GENETICS, 2010, 11 (01) : 31 - 46
  • [10] SOLIMAN, 2009, INT J BIOINFORM RES, V5, P593