The Metagenomic Binning Problem: Clustering Markov Sequences

被引:0
作者
Greenberg, Grant [1 ]
Shomorony, Ilan [1 ]
机构
[1] Univ Illinois, Dept Elect & Comp Engn, Urbana, IL 61801 USA
来源
IEEE TRANSACTIONS ON MOLECULAR BIOLOGICAL AND MULTI-SCALE COMMUNICATIONS | 2024年 / 10卷 / 01期
基金
美国国家科学基金会;
关键词
Biological information theory; metagenomics; clustering algorithms;
D O I
10.1109/TMBMC.2023.3336254
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The goal of metagenomics is to study the composition of microbial communities, typically using high -throughput shotgun sequencing. In the metagenomic binning problem, we observe random substrings (called contigs) from a mixture of genomes and aim to cluster them according to their genome of origin. Based on the empirical observation that genomes of different bacterial species can be distinguished based on their tetranucleotide frequencies, we model this task as the problem of clustering N sequences generated by M distinct Markov processes, where M << N. Utilizing the large -deviation principle for Markov processes, we establish the information -theoretic limit for perfect binning. Specifically, we show that the length of the contigs must scale with the inverse of the Chernoff divergence rate between the two most similar species. Furthermore, our result implies that contigs should be binned using the KL divergence rate as a measure of distance, as opposed to the Euclidean distance often used in practice.
引用
收藏
页码:32 / 42
页数:11
相关论文
共 36 条
  • [1] Exact Recovery in the Stochastic Block Model
    Abbe, Emmanuel
    Bandeira, Afonso S.
    Hall, Georgina
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2016, 62 (01) : 471 - 487
  • [2] Akaike H, 1998, Springer series in statistics, P199, DOI [10.1007/978-1-4612-1694-015, DOI 10.1007/978-1-4612-1694-015]
  • [3] Investigations of oligonucleotide usage variance within and between prokaryotes
    Bohlin, Jon
    Skjerve, Eystein
    Ussery, David W.
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2008, 4 (04)
  • [4] PhymmBL expanded: confidence scores, custom databases, parallelization and more
    Brady, Arthur
    Salzberg, Steven
    [J]. NATURE METHODS, 2011, 8 (05) : 367 - 367
  • [5] A review of methods and databases for metagenomic classification and assembly
    Breitwieser, Florian P.
    Lu, Jennifer
    Salzberg, Steven L.
    [J]. BRIEFINGS IN BIOINFORMATICS, 2019, 20 (04) : 1125 - 1139
  • [6] Bioinformatics for whole-genome shotgun sequencing of microbial communities
    Chen, K
    Pachter, L
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2005, 1 (02) : 106 - 112
  • [7] Cover TM., 2005, Elements of Information Theory, DOI [10.1002/047174882x, DOI 10.1002/047174882X]
  • [8] Computational approaches to predict bacteriophage-host relationships
    Edwards, Robert A.
    McNair, Katelyn
    Faust, Karoline
    Raes, Jeroen
    Dutilh, Bas E.
    [J]. FEMS MICROBIOLOGY REVIEWS, 2016, 40 (02) : 258 - 272
  • [9] CAMISIM: simulating metagenomes and microbial communities
    Fritz, Adrian
    Hofmann, Peter
    Majda, Stephan
    Dahms, Eik
    Droege, Johannes
    Fiedler, Jessika
    Lesker, Till R.
    Belmann, Peter
    DeMaere, Matthew Z.
    Darling, Aaron E.
    Sczyrba, Alexander
    Bremges, Andreas
    McHardy, Alice C.
    [J]. MICROBIOME, 2019, 7 (1)
  • [10] ASYMPTOTICALLY OPTIMAL CLASSIFICATION FOR MULTIPLE TESTS WITH EMPIRICALLY OBSERVED STATISTICS
    GUTMAN, M
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 1989, 35 (02) : 401 - 408