Crumble: reference free lossy compression of sequence quality values

被引:18
作者
Bonfield, James K. [1 ]
McCarthy, Shane A. [1 ,2 ]
Durbin, Richard [1 ,2 ]
机构
[1] DNA Pipelines, Wellcome Sanger Inst, Wellcome Genome Campus, Hinxton CB10 1SA, England
[2] Univ Cambridge, Dept Genet, Cambridge CB2 3EH, England
基金
英国惠康基金;
关键词
D O I
10.1093/bioinformatics/bty608
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: The bulk of space taken up by NGS sequencing CRAM files consists of per-base quality values. Most of these are unnecessary for variant calling, offering an opportunity for space saving. Results: On the Syndip test set, a 17 fold reduction in the quality storage portion of a CRAM file can be achieved while maintaining variant calling accuracy. The size reduction of an entire CRAM file varied from 2.2 to 7.4 fold, depending on the non-quality content of the original file (see Supplementary Material S6 for details).
引用
收藏
页码:337 / 339
页数:3
相关论文
共 17 条
[1]   Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph [J].
Benoit, Gaetan ;
Lemaitre, Claire ;
Lavenier, Dominique ;
Drezen, Erwan ;
Dayris, Thibault ;
Uricaru, Raluca ;
Rizk, Guillaume .
BMC BIOINFORMATICS, 2015, 16
[2]   Gap5-editing the billion fragment sequence assembly [J].
Bonfield, James K. ;
Whitwham, Andrew .
BIOINFORMATICS, 2010, 26 (14) :1699-1703
[3]   Lossy compression of quality scores in genomic data [J].
Canovas, Rodrigo ;
Moffat, Alistair ;
Turpin, Andrew .
BIOINFORMATICS, 2014, 30 (15) :2130-2136
[4]   Efficient storage of high throughput DNA sequencing data using reference-based compression [J].
Fritz, Markus Hsi-Yang ;
Leinonen, Rasko ;
Cochrane, Guy ;
Birney, Ewan .
GENOME RESEARCH, 2011, 21 (05) :734-740
[5]  
Garrison E., 2012, GENOMICS
[6]   GeneCodeq: quality score compression and improved genotyping using a Bayesian framework [J].
Greenfield, Daniel L. ;
Stegle, Oliver ;
Rrustemi, Alban .
BIOINFORMATICS, 2016, 32 (20) :3124-3132
[7]  
Illumina, 2014, TECHNICAL REPORT
[8]   A synthetic-diploid benchmark for accurate variant-calling evaluation [J].
Li, Heng ;
Bloom, Jonathan M. ;
Farjoun, Yossi ;
Fleharty, Mark ;
Gauthier, Laura ;
Neale, Benjamin ;
MacArthur, Daniel .
NATURE METHODS, 2018, 15 (08) :595-+
[9]   A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data [J].
Li, Heng .
BIOINFORMATICS, 2011, 27 (21) :2987-2993
[10]  
Li H, 2009, BIOINFORMATICS, V25, P1094, DOI [10.1093/bioinformatics/btp100, 10.1093/bioinformatics/btp324]