HiCMC: High-Efficiency Contact Matrix Compressor

被引:0
作者
Adhisantoso, Yeremia Gunawan [1 ,2 ]
Koerner, Tim [1 ,2 ]
Muentefering, Fabian [1 ,2 ]
Ostermann, Joern [1 ,2 ]
Voges, Jan [3 ,4 ]
机构
[1] Leibniz Univ Hannover, Inst Informationsverarbeitung, Hannover, Germany
[2] Leibniz Univ Hannover, Res Ctr L3S, Hannover, Germany
[3] CIMA Univ Navarra, Pamplona, Spain
[4] IdiSNA, Pamplona, Spain
来源
BMC BIOINFORMATICS | 2024年 / 25卷 / 01期
关键词
Contact matrix; Hi-C; 3C; Compression; GENE-REGULATION; PRINCIPLES; FORMAT; MAPS;
D O I
10.1186/s12859-024-05907-2
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
BackgroundChromosome organization plays an important role in biological processes such as replication, regulation, and transcription. One way to study the relationship between chromosome structure and its biological functions is through Hi-C studies, a genome-wide method for capturing chromosome conformation. Such studies generate vast amounts of data. The problem is exacerbated by the fact that chromosome organization is dynamic, requiring snapshots at different points in time, further increasing the amount of data to be stored. We present a novel approach called the High-Efficiency Contact Matrix Compressor (HiCMC) for efficient compression of Hi-C data.ResultsBy modeling the underlying structures found in the contact matrix, such as compartments and domains, HiCMC outperforms the state-of-the-art method CMC by approximately 8% and the other state-of-the-art methods cooler, LZMA, and bzip2 by over 50% across multiple cell lines and contact matrix resolutions. In addition, HiCMC integrates domain-specific information into the compressed bitstreams that it generates, and this information can be used to speed up downstream analyses.ConclusionHiCMC is a novel compression approach that utilizes intrinsic properties of contact matrix, such as compartments and domains. It allows for a better compression in comparison to the state-of-the-art methods. HiCMC is available at https://github.com/sXperfect/hicmc.
引用
收藏
页数:15
相关论文
共 42 条
[1]  
Abdennur N., 2022, BIORXIV, DOI DOI 10.1101/2022.10.31.514564
[2]   Cooler: scalable storage for Hi-C data and other genomically labeled arrays [J].
Abdennur, Nezar ;
Mirny, Leonid A. .
BIOINFORMATICS, 2020, 36 (01) :311-316
[3]   Contact Matrix Compressor [J].
Adhisantoso, Yeremia Gunawan ;
Ostermann, Jorn .
DCC 2022: 2022 DATA COMPRESSION CONFERENCE (DCC), 2022, :399-408
[4]   Optuna: A Next-generation Hyperparameter Optimization Framework [J].
Akiba, Takuya ;
Sano, Shotaro ;
Yanase, Toshihiko ;
Ohta, Takeru ;
Koyama, Masanori .
KDD'19: PROCEEDINGS OF THE 25TH ACM SIGKDD INTERNATIONAL CONFERENCCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2019, :2623-2631
[5]  
Bergstra J. S., 2011, ADV NEURAL INFORM PR, P2546, DOI DOI 10.5555/2986459.2986743
[6]   CRAM 3.1: advances in the CRAM file format [J].
Bonfield, James K. .
BIOINFORMATICS, 2022, 38 (06) :1497-1503
[7]   The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants [J].
Cock, Peter J. A. ;
Fields, Christopher J. ;
Goto, Naohisa ;
Heuer, Michael L. ;
Rice, Peter M. .
NUCLEIC ACIDS RESEARCH, 2010, 38 (06) :1767-1771
[8]  
Collet Y., 2018, RFC Informational, DOI DOI 10.17487/RFC8878
[9]   Chromosome territories, nuclear architecture and gene regulation in mammalian cells [J].
Cremer, T ;
Cremer, C .
NATURE REVIEWS GENETICS, 2001, 2 (04) :292-301
[10]   Capturing chromosome conformation [J].
Dekker, J ;
Rippe, K ;
Dekker, M ;
Kleckner, N .
SCIENCE, 2002, 295 (5558) :1306-1311