No-Reference Compression of Genomic Data Stored In FASTQ Format

被引:19
|
作者
Bhola, Vishal [1 ]
Bopardikar, Ajit S. [1 ]
Narayanan, Rangavittal [1 ]
Lee, Kyusang [2 ]
Ahn, TaeJin [2 ]
机构
[1] Samsung India Software Operat, SAIT India, Bangalore, Karnataka, India
[2] Samsung Elect Co Ltd Suwon, SAIT, Suwon, South Korea
关键词
FASTQ; Next generation sequencing; Genomic Data Compression; SEQUENCE;
D O I
10.1109/BIBM.2011.110
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper, we propose a system to compress Next Generation Sequencing (NGS) information stored in a FASTQ file. A FASTQ file contains text, DNA read and quality information for millions or billions of reads. The proposed system first parses the FASTQ file into its component fields. In a partial first pass it gathers statistics which are then used to choose a representation for each field that can give the best compression. Text data is further parsed into repeating and variable components and entropy coding is used to compress the latter. Similarly, Markov encoding and repeat finding based methods are used for DNA read compression. Finally, we propose several run length based methods to encode quality data choosing the method that gives the best performance for a given set of quality values. The compression system provides features for lossless and nearly lossless compression as well as compressing only read and read + quality data. We compare its performance to bzip2 text compression utility and an existing benchmark algorithm. We observe that the performance of the proposed system is superior to that of both the systems.
引用
收藏
页码:147 / 150
页数:4
相关论文
共 50 条
  • [11] FCompress: An Algorithm for FASTQ Sequence Data Compression
    Sardaraz, Muhammad
    Tahir, Muhammad
    CURRENT BIOINFORMATICS, 2019, 14 (02) : 123 - 129
  • [12] High-Throughput Compression of FASTQ Data with SeqDB
    Howison, Mark
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2013, 10 (01) : 213 - 218
  • [13] Lossless and reference-free compression of FASTQ/A files using GeneSqueeze
    Nazari, Foad
    Patel, Sneh
    Larocca, Melissa
    Sansevich, Alina
    Czarny, Ryan
    Schena, Giana
    Murray, Emma K.
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [14] FPGA Acceleration of Reference-Based Compression for Genomic Data
    Arram, James
    Pflanzer, Moritz
    Kaplan, Thomas
    Luk, Wayne
    2015 INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE TECHNOLOGY (FPT), 2015, : 9 - 16
  • [15] A NO-REFERENCE VIDEO QUALITY PREDICTOR FOR COMPRESSION AND SCALING ARTIFACTS
    Ghadiyaram, Deepti
    Chen, Chao
    Inguva, Sasi
    Kokaram, Anil
    2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2017, : 3445 - 3449
  • [16] LW-FQZip 2: a parallelized reference-based compression of FASTQ files
    Huang, Zhi-An
    Wen, Zhenkun
    Deng, Qingjin
    Chu, Ying
    Sun, Yiwen
    Zhu, Zexuan
    BMC BIOINFORMATICS, 2017, 18
  • [17] LW-FQZip 2: a parallelized reference-based compression of FASTQ files
    Zhi-An Huang
    Zhenkun Wen
    Qingjin Deng
    Ying Chu
    Yiwen Sun
    Zexuan Zhu
    BMC Bioinformatics, 18
  • [18] Genomic Data Compression
    Hernaez, Mikel
    Pavlichin, Dmitri
    Weissman, Tsachy
    Ochoa, Idoia
    ANNUAL REVIEW OF BIOMEDICAL DATA SCIENCE, VOL 2, 2019, 2019, 2 : 19 - 37
  • [19] Engineering a differencing and compression data format
    Korn, DG
    Vo, KP
    USENIX ASSOCIATION PROCEEDINGS OF THE GENERAL TRACK, 2002, : 219 - 228
  • [20] FORMAT BASED DATA-COMPRESSION
    LYONS, NR
    DATA BASE, 1983, 14 (02): : 15 - 18