A Bayesian taxonomic classification method for 16S rRNA gene sequences with improved species-level accuracy

被引:146
作者
Gao, Xiang [1 ]
Lin, Huaiying [1 ,2 ]
Revanna, Kashi [1 ,2 ]
Dong, Qunfeng [1 ,2 ,3 ,4 ]
机构
[1] Loyola Univ, Dept Publ Hlth Sci, Chicago Hlth Sci Div, Maywood, IL 60153 USA
[2] Loyola Univ, Ctr Biomed Informat, Chicago Hlth Sci Div, Maywood, IL 60153 USA
[3] Loyola Univ, Bioinformat Program, Chicago Lake Shore Campus, Chicago, IL 60660 USA
[4] Loyola Univ, Dept Comp Sci, Chicago Water Tower Campus, Chicago, IL 60611 USA
来源
BMC BIOINFORMATICS | 2017年 / 18卷
关键词
16S rRNA gene; Taxonomic classification; HIGH-THROUGHPUT; DATABASE; TOOLS; BLAST;
D O I
10.1186/s12859-017-1670-4
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Species-level classification for 16S rRNA gene sequences remains a serious challenge for microbiome researchers, because existing taxonomic classification tools for 16S rRNA gene sequences either do not provide species-level classification, or their classification results are unreliable. The unreliable results are due to the limitations in the existing methods which either lack solid probabilistic-based criteria to evaluate the confidence of their taxonomic assignments, or use nucleotide k-mer frequency as the proxy for sequence similarity measurement. Results: We have developed a method that shows significantly improved species-level classification results over existing methods. Our method calculates true sequence similarity between query sequences and database hits using pairwise sequence alignment. Taxonomic classifications are assigned from the species to the phylum levels based on the lowest common ancestors of multiple database hits for each query sequence, and further classification reliabilities are evaluated by bootstrap confidence scores. The novelty of our method is that the contribution of each database hit to the taxonomic assignment of the query sequence is weighted by a Bayesian posterior probability based upon the degree of sequence similarity of the database hit to the query sequence. Our method does not need any training datasets specific for different taxonomic groups. Instead only a reference database is required for aligning to the query sequences, making our method easily applicable for different regions of the 16S rRNA gene or other phylogenetic marker genes. Conclusions: Reliable species-level classification for 16S rRNA or other phylogenetic marker genes is critical for microbiome research. Our software shows significantly higher classification accuracy than the existing tools and we provide probabilistic-based confidence scores to evaluate the reliability of our taxonomic classification assignments based on multiple database matches to query sequences. Despite its higher computational costs, our method is still suitable for analyzing large-scale microbiome datasets for practical purposes. Furthermore, our method can be applied for taxonomic classification of any phylogenetic marker gene sequences. Our software, called BLCA, is freely available at https://github.com/qunfengdong/BLCA.
引用
收藏
页数:10
相关论文
共 22 条
  • [1] SPINGO: a rapid species-classifier for microbial amplicon sequences
    Allard, Guy
    Ryan, Feargal J.
    Jeffery, Ian B.
    Claesson, Marcus J.
    [J]. BMC BIOINFORMATICS, 2015, 16
  • [2] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [3] [Anonymous], 1993, INTRO BOOTSTRAP
  • [4] QIIME allows analysis of high-throughput community sequencing data
    Caporaso, J. Gregory
    Kuczynski, Justin
    Stombaugh, Jesse
    Bittinger, Kyle
    Bushman, Frederic D.
    Costello, Elizabeth K.
    Fierer, Noah
    Pena, Antonio Gonzalez
    Goodrich, Julia K.
    Gordon, Jeffrey I.
    Huttley, Gavin A.
    Kelley, Scott T.
    Knights, Dan
    Koenig, Jeremy E.
    Ley, Ruth E.
    Lozupone, Catherine A.
    McDonald, Daniel
    Muegge, Brian D.
    Pirrung, Meg
    Reeder, Jens
    Sevinsky, Joel R.
    Tumbaugh, Peter J.
    Walters, William A.
    Widmann, Jeremy
    Yatsunenko, Tanya
    Zaneveld, Jesse
    Knight, Rob
    [J]. NATURE METHODS, 2010, 7 (05) : 335 - 336
  • [5] 16S Classifier: A Tool for Fast and Accurate Taxonomic Classification of 16S rRNA Hypervariable Regions in Metagenomic Datasets
    Chaudhary, Nikhil
    Sharma, Ashok K.
    Agarwal, Piyush
    Gupta, Ankit
    Sharma, Vineet K.
    [J]. PLOS ONE, 2015, 10 (02):
  • [6] The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis
    Cole, JR
    Chai, B
    Farris, RJ
    Wang, Q
    Kulam, SA
    McGarrell, DM
    Garrity, GM
    Tiedje, JM
    [J]. NUCLEIC ACIDS RESEARCH, 2005, 33 : D294 - D296
  • [7] Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB
    DeSantis, T. Z.
    Hugenholtz, P.
    Larsen, N.
    Rojas, M.
    Brodie, E. L.
    Keller, K.
    Huber, T.
    Dalevi, D.
    Hu, P.
    Andersen, G. L.
    [J]. APPLIED AND ENVIRONMENTAL MICROBIOLOGY, 2006, 72 (07) : 5069 - 5072
  • [8] MUSCLE: multiple sequence alignment with high accuracy and high throughput
    Edgar, RC
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 (05) : 1792 - 1797
  • [9] FELSENSTEIN J, 1985, EVOLUTION, V39, P783, DOI 10.1111/j.1558-5646.1985.tb00420.x
  • [10] Species-level classification of the vaginal microbiome
    Fettweis, Jennifer M.
    Serrano, Myrna G.
    Sheth, Nihar U.
    Mayer, Carly M.
    Glascock, Abigail L.
    Brooks, J. Paul
    Jefferson, Kimberly K.
    Buck, Gregory A.
    [J]. BMC GENOMICS, 2012, 13