CH-Bin: A convex hull based approach for binning metagenomic contigs

被引:3
作者
Chandrasiri, Sunera [1 ]
Perera, Thumula [1 ]
Dilhara, Anjala [1 ]
Perera, Indika [1 ]
Mallawaarachchi, Vijini [2 ,3 ]
机构
[1] Univ Moratuwa, Dept Comp Sci & Engn, Moratuwa 10400, Sri Lanka
[2] Australian Natl Univ, Sch Comp, Canberra, ACT 2600, Australia
[3] Flinders Univ S Australia, Flinders Accelerator Microbiome Explorat, Bedford Pk, SA 5042, Australia
基金
美国国家卫生研究院;
关键词
Convex hull; Convex hull distance; Metagenomic binning; Multiple k values; High dimensional data clustering; Clustering algorithm; CLASSIFICATION; SEQUENCES; GENOMES; ALGORITHM; COVERAGE;
D O I
10.1016/j.compbiolchem.2022.107734
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Metagenomics has enabled culture-independent analysis of micro-organisms present in environmental samples. Metagenomics binning, which involves the grouping of contigs into bins that represent different taxonomic groups, is an important step of a typical metagenomic workflow followed after assembly. The majority of the metagenomic binning tools represent the composition and coverage information of contigs as feature vectors consisting of a large number of dimensions. However, these tools use traditional Euclidean distance or Manhattan distance metrics which become unreliable in the high dimensional space. We propose CH-Bin, a binning approach that leverages the benefits of using convex hull distance for binning contigs represented by high dimensional feature vectors. We demonstrate using experimental evidence on simulated and real datasets that the use of high dimensional feature vectors to represent contigs can preserve additional information, and result in improved binning results. We further demonstrate that the convex hull distance based binning approach can be effectively utilized in binning such high dimensional data. To the best of our knowledge, this is the first time that composition information from oligonucleotides of multiple sizes has been used in representing the composition information of contigs and a convex hull distance based binning algorithm has been used to bin metagenomic contigs. The source code of CH-Bin is available at https://github.com/kdsuneraavinash/CH-Bin.
引用
收藏
页数:9
相关论文
共 56 条
  • [1] Automatic subspace clustering of high dimensional data
    Agrawal, R
    Gehrke, J
    Gunopulos, D
    Raghavan, P
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2005, 11 (01) : 5 - 33
  • [2] Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes
    Albertsen, Mads
    Hugenholtz, Philip
    Skarshewski, Adam
    Nielsen, Kare L.
    Tyson, Gene W.
    Nielsen, Per H.
    [J]. NATURE BIOTECHNOLOGY, 2013, 31 (06) : 533 - +
  • [3] Alneberg J, 2014, NAT METHODS, V11, P1144, DOI [10.1038/nmeth.3103, 10.1038/NMETH.3103]
  • [4] Microbiome connections with host metabolism and habitual diet from 1,098 deeply phenotyped individuals
    Asnicar, Francesco
    Berry, Sarah E.
    Valdes, Ana M.
    Nguyen, Long H.
    Piccinno, Gianmarco
    Drew, David A.
    Leeming, Emily
    Gibson, Rachel
    Le Roy, Caroline
    Al Khatib, Haya
    Francis, Lucy
    Mazidi, Mohsen
    Mompeo, Olatz
    Valles-Colomer, Mireia
    Tett, Adrian
    Beghini, Francesco
    Dubois, Leonard
    Bazzani, Davide
    Thomas, Andrew Maltez
    Mirzayi, Chloe
    Khleborodova, Asya
    Oh, Sehyun
    Hine, Rachel
    Bonnett, Christopher
    Capdevila, Joan
    Danzanvilliers, Serge
    Giordano, Francesca
    Geistlinger, Ludwig
    Waldron, Levi
    Davies, Richard
    Hadjigeorgiou, George
    Wolf, Jonathan
    Ordovas, Jose M.
    Gardner, Christopher
    Franks, Paul W.
    Chan, Andrew T.
    Huttenhower, Curtis
    Spector, Tim D.
    Segata, Nicola
    [J]. NATURE MEDICINE, 2021, 27 (02) : 321 - +
  • [5] Selective carbon sources influence the end products of microbial nitrate respiration
    Carlson, Hans K.
    Lui, Lauren M.
    Price, Morgan N.
    Kazakov, Alexey E.
    Carr, Alex V.
    Kuehl, Jennifer V.
    Owens, Trenton K.
    Nielsen, Torben
    Arkin, Adam P.
    Deutschbauer, Adam M.
    [J]. ISME JOURNAL, 2020, 14 (08) : 2034 - 2045
  • [6] High-dimensional data clustering by using local affine/convex hulls
    Cevikalp, Hakan
    [J]. PATTERN RECOGNITION LETTERS, 2019, 128 : 427 - 432
  • [7] Chatterji S., 2007, RES COMPUTATIONAL MO, V4955
  • [8] APPLICATIONS OF NEXT-GENERATION SEQUENCING The human microbiome: at the interface of health and disease
    Cho, Ilseung
    Blaser, Martin J.
    [J]. NATURE REVIEWS GENETICS, 2012, 13 (04) : 260 - 270
  • [9] Twelve years of SAMtools and BCFtools
    Danecek, Petr
    Bonfield, James K.
    Liddle, Jennifer
    Marshall, John
    Ohan, Valeriu
    Pollard, Martin O.
    Whitwham, Andrew
    Keane, Thomas
    McCarthy, Shane A.
    Davies, Robert M.
    Li, Heng
    [J]. GIGASCIENCE, 2021, 10 (02):
  • [10] Genomic signature: Characterization and classification of species assessed by chaos game representation of sequences
    Deschavanne, PJ
    Giron, A
    Vilain, J
    Fagot, G
    Fertil, B
    [J]. MOLECULAR BIOLOGY AND EVOLUTION, 1999, 16 (10) : 1391 - 1399