Reference-based compression of short-read sequences using path encoding

被引:27
作者
Kingsford, Carl [1 ]
Patro, Rob [2 ]
机构
[1] Carnegie Mellon Univ, Dept Computat Biol, Sch Comp Sci, Pittsburgh, PA 15213 USA
[2] SUNY Stony Brook, Dept Comp Sci, Stony Brook, NY 11794 USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
BURROWS-WHEELER TRANSFORM; GENOMIC SEQUENCE; LOSSY COMPRESSION; QUALITY SCORES; ALGORITHMS; DATABASES; ALIGNMENT; EMBRYOS; FORMAT; FASTQ;
D O I
10.1093/bioinformatics/btv071
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed. Results: We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. Supporting this method is a system to compactly store sets of kmers that is of independent interest. We are able to encode RNA-seq reads using 3-11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved.
引用
收藏
页码:1920 / 1928
页数:9
相关论文
共 44 条
  • [1] DNA sequence compression using the Burrows-Wheeler Transform
    Adjeroh, D
    Zhang, Y
    Mukherjee, A
    Powell, M
    Bell, T
    [J]. CSB2002: IEEE COMPUTER SOCIETY BIOINFORMATICS CONFERENCE, 2002, : 303 - 313
  • [2] No-Reference Compression of Genomic Data Stored In FASTQ Format
    Bhola, Vishal
    Bopardikar, Ajit S.
    Narayanan, Rangavittal
    Lee, Kyusang
    Ahn, TaeJin
    [J]. 2011 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM 2011), 2011, : 147 - 150
  • [3] The Scramble conversion tool
    Bonfield, James K.
    [J]. BIOINFORMATICS, 2014, 30 (19) : 2818 - 2819
  • [4] Compression of FASTQ and SAM Format Sequencing Data
    Bonfield, James K.
    Mahoney, Matthew V.
    [J]. PLOS ONE, 2013, 8 (03):
  • [5] Data structures and compression algorithms for genomic sequence data
    Brandon, Marty C.
    Wallace, Douglas C.
    Baldi, Pierre
    [J]. BIOINFORMATICS, 2009, 25 (14) : 1731 - 1738
  • [6] Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments
    Bullard, James H.
    Purdom, Elizabeth
    Hansen, Kasper D.
    Dudoit, Sandrine
    [J]. BMC BIOINFORMATICS, 2010, 11
  • [7] Fulcrum: condensing redundant reads from high-throughput sequencing studies
    Burriesci, Matthew S.
    Lehnert, Erik M.
    Pringle, John R.
    [J]. BIOINFORMATICS, 2012, 28 (10) : 1324 - 1327
  • [8] Burrows M, 1994, BLOCK SORTING LOSSLE
  • [9] Compression of Structured High-Throughput Sequencing Data
    Campagne, Fabien
    Dorff, Kevin C.
    Chambwe, Nyasha
    Robinson, James T.
    Mesirov, Jill P.
    [J]. PLOS ONE, 2013, 8 (11):
  • [10] Lossy compression of quality scores in genomic data
    Canovas, Rodrigo
    Moffat, Alistair
    Turpin, Andrew
    [J]. BIOINFORMATICS, 2014, 30 (15) : 2130 - 2136