No-Reference Compression of Genomic Data Stored In FASTQ Format

被引：19

作者：

Bhola, Vishal ^{[1
]}

Bopardikar, Ajit S. ^{[1
]}

Narayanan, Rangavittal ^{[1
]}

Lee, Kyusang ^{[2
]}

Ahn, TaeJin ^{[2
]}

机构：

[1] Samsung India Software Operat, SAIT India, Bangalore, Karnataka, India

[2] Samsung Elect Co Ltd Suwon, SAIT, Suwon, South Korea

来源：

2011 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM 2011) | 2011年

关键词：

FASTQ; Next generation sequencing; Genomic Data Compression; SEQUENCE;

D O I：

10.1109/BIBM.2011.110

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

In this paper, we propose a system to compress Next Generation Sequencing (NGS) information stored in a FASTQ file. A FASTQ file contains text, DNA read and quality information for millions or billions of reads. The proposed system first parses the FASTQ file into its component fields. In a partial first pass it gathers statistics which are then used to choose a representation for each field that can give the best compression. Text data is further parsed into repeating and variable components and entropy coding is used to compress the latter. Similarly, Markov encoding and repeat finding based methods are used for DNA read compression. Finally, we propose several run length based methods to encode quality data choosing the method that gives the best performance for a given set of quality values. The compression system provides features for lossless and nearly lossless compression as well as compressing only read and read + quality data. We compare its performance to bzip2 text compression utility and an existing benchmark algorithm. We observe that the performance of the proposed system is superior to that of both the systems.

引用

页码：147 / 150

页数：4

共 7 条

[1] Next-generation DNA sequencing techniques [J].

Ansorge, Wilhelm J. .

NEW BIOTECHNOLOGY, 2009, 25 (04) :195-203

[2] The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants [J].

Cock, Peter J. A. ;

Fields, Christopher J. ;

Goto, Naohisa ;

Heuer, Michael L. ;

Rice, Peter M. .

NUCLEIC ACIDS RESEARCH, 2010, 38 (06) :1767-1771

[3] Compression of DNA sequence reads in FASTQ format [J].

Deorowicz, Sebastian ;

Grabowski, Szymon .

BIOINFORMATICS, 2011, 27 (06) :860-862

[4]

Fritz M.H. Y., 2011, Genome Research

[5] Textual data compression in computational biology: a synopsis [J].

Giancarlo, Raffaele ;

Scaturro, Davide ;

Utro, Filippo .

BIOINFORMATICS, 2009, 25 (13) :1575-1586

[6]

Kaipa KK, 2010, IEEE INT C BIO BIO W, P851, DOI 10.1109/BIBMW.2010.5703941

[7] G-SQZ: compact encoding of genomic sequence and quality data [J].

Tembe, Waibhav ;

Lowey, James ;

Suh, Edward .

BIOINFORMATICS, 2010, 26 (17) :2192-2194

← 1 →