A novel lossless encoding algorithm for data compression-genomics data as an exemplar

被引:0
|
作者
Al-okaily, Anas [1 ]
Tbakhi, Abdelghani [2 ]
机构
[1] King Hussein Canc Ctr, Dept Cell Therapy Appl Genom, Amman, Jordan
[2] McMaster Univ, Dept Pathol & Mol Med, Hamilton, ON, Canada
来源
关键词
compression; Huffman encoding; LZ; genomics; BWT; SEQUENCES; FORMAT;
D O I
10.3389/fbinf.2024.1489704
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Data compression is a challenging and increasingly important problem. As the amount of data generated daily continues to increase, efficient transmission and storage have never been more critical. In this study, a novel encoding algorithm is proposed, motivated by the compression of DNA data and associated characteristics. The proposed algorithm follows a divide-and-conquer approach by scanning the whole genome, classifying subsequences based on similarities in their content, and binning similar subsequences together. The data is then compressed into each bin independently. This approach is different than the currently known approaches: entropy, dictionary, predictive, or transform-based methods. Proof-of-concept performance was evaluated using a benchmark dataset with seventeen genomes ranging in size from kilobytes to gigabytes. The results showed a considerable improvement in the compression of each genome, preserving several megabytes compared to state-of-the-art tools. Moreover, the algorithm can be applied to the compression of other data types include mainly text, numbers, images, audio, and video which are being generated daily and unprecedentedly in massive volumes.
引用
收藏
页数:8
相关论文
共 50 条
  • [41] LZAC lossless data compression
    Chu, A
    DCC 2002: DATA COMPRESSION CONFERENCE, PROCEEDINGS, 2002, : 449 - 449
  • [42] Lossless Compression of Climate Data
    Mummadisetty, Bharath Chandra
    Puri, Astha
    Sharifahmadian, Ershad
    Latifi, Shahram
    PROGRESS IN SYSTEMS ENGINEERING, 2015, 366 : 391 - 400
  • [43] Lossless Compression of Meteorological Data
    Iza-Teran, Rodrigo
    Lorentz, Rudolph
    ERCIM NEWS, 2005, (61): : 41 - 41
  • [44] A DNA Data Storage Method Using Spatial Encoding Based Lossless Compression
    Satir, Esra
    ENTROPY, 2024, 26 (12)
  • [45] On the estimation of the probability distribution of a non stationary source for lossless data compression lossless data compression
    Pfefferman, JD
    Gonzalez, HJ
    CernuschiFrias, B
    INTERNATIONAL CONFERENCE ON IMAGE PROCESSING - PROCEEDINGS, VOL II, 1997, : 270 - 273
  • [46] Preprocessing and Golomb-Rice Encoding for Lossless Compression of Phasor Angle Data
    Tate, Joseph Euzebe
    IEEE TRANSACTIONS ON SMART GRID, 2016, 7 (02) : 718 - 729
  • [47] A Lossless Image Compression Algorithm Based On Group Encoding
    Koval, Vasyl
    Yatskiv, Vasyl
    Yakymenko, Igor
    Zahorodnia, Diana
    2020 10TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER INFORMATION TECHNOLOGIES (ACIT), 2020, : 871 - 874
  • [48] Application of optimized sparse encoding algorithm in data compression
    Song, Liqiang
    Ma, Weining
    Liu, Zhongxin
    Shi, Zhiyong
    DIGITAL SIGNAL PROCESSING, 2024, 151
  • [49] A Novel Compression Algorithm for LiDAR Data
    Du, Ruoyu
    Lee, Hyo Jong
    2012 5TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING (CISP), 2012, : 987 - 991
  • [50] An Efficient Lossless Compression Algorithm for Trajectories of Atom Positions and Volumetric Data
    Brehm, Martin
    Thomas, Martin
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2018, 58 (10) : 2092 - 2107