Using the RDP Classifier to Predict Taxonomic Novelty and Reduce the Search Space for Finding Novel Organisms

被引:165
作者
Lan, Yemin [1 ]
Wang, Qiong [2 ]
Cole, James R. [1 ]
Rosen, Gail L. [3 ]
机构
[1] Drexel Univ, Sch Biomed Engn Sci & Hlth Syst, Philadelphia, PA 19104 USA
[2] Michigan State Univ, Ribosomal Database Project, E Lansing, MI 48824 USA
[3] Drexel Univ, Dept Elect & Comp Engn, Philadelphia, PA 19104 USA
基金
美国国家科学基金会;
关键词
DIVERSITY; SEQUENCES;
D O I
10.1371/journal.pone.0032491
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: Currently, the naive Bayesian classifier provided by the Ribosomal Database Project (RDP) is one of the most widely used tools to classify 16S rRNA sequences, mainly collected from environmental samples. We show that RDP has 97+% assignment accuracy and is fast for 250 bp and longer reads when the read originates from a taxon known to the database. Because most environmental samples will contain organisms from taxa whose 16S rRNA genes have not been previously sequenced, we aim to benchmark how well the RDP classifier and other competing methods can discriminate these novel taxa from known taxa. Principal Findings: Because each fragment is assigned a score (containing likelihood or confidence information such as the boostrap score in the RDP classifier), we "train" a threshold to discriminate between novel and known organisms and observe its performance on a test set. The threshold that we determine tends to be conservative (low sensitivity but high specificity) for naive Bayesian methods. Nonetheless, our method performs better with the RDP classifier than the other methods tested, measured by the f-measure and the area-under-the-curve on the receiver operating characteristic of the test set. By constraining the database to well-represented genera, sensitivity improves 3-15%. Finally, we show that the detector is a good predictor to determine novel abundant taxa (especially for finer levels of taxonomy where novelty is more likely to be present). Conclusions: We conclude that selecting a read-length appropriate RDP bootstrap score can significantly reduce the search space for identifying novel genera and higher levels in taxonomy. In addition, having a well-represented database significantly improves performance while having genera that are "highly" similar does not make a significant improvement. On a real dataset from an Amazon Terra Preta soil sample, we show that the detector can predict (or correlates to) whether novel sequences will be assigned to new taxa when the RDP database "doubles" in the future.
引用
收藏
页数:15
相关论文
共 26 条
[1]   Prokaryotic Genomes and Diversity in Surface Ocean Waters: Interrogating the Global Ocean Sampling Metagenome [J].
Biers, Erin J. ;
Sun, Shulei ;
Howard, Erinn C. .
APPLIED AND ENVIRONMENTAL MICROBIOLOGY, 2009, 75 (07) :2221-2229
[2]  
Brady A, 2009, NAT METHODS, V6, P673, DOI [10.1038/nmeth.1358, 10.1038/NMETH.1358]
[3]   The Ribosomal Database Project: improved alignments and new tools for rRNA analysis [J].
Cole, J. R. ;
Wang, Q. ;
Cardenas, E. ;
Fish, J. ;
Chai, B. ;
Farris, R. J. ;
Kulam-Syed-Mohideen, A. S. ;
McGarrell, D. M. ;
Marsh, T. ;
Garrity, G. M. ;
Tiedje, J. M. .
NUCLEIC ACIDS RESEARCH, 2009, 37 :D141-D145
[4]   Ecology of the rare microbial biosphere of the Arctic Ocean [J].
Galand, Pierre E. ;
Casamayor, Emilio O. ;
Kirchman, David L. ;
Lovejoy, Connie .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2009, 106 (52) :22427-22432
[5]   16S rRNA gene sequencing for bacterial identification in the diagnostic laboratory: Pluses, perils, and pitfalls [J].
Janda, J. Michael ;
Abbott, Sharon L. .
JOURNAL OF CLINICAL MICROBIOLOGY, 2007, 45 (09) :2761-2764
[6]   A comprehensive survey of soil acidobacterial diversity using pyrosequencing and clone library analyses [J].
Jones, Ryan T. ;
Robeson, Michael S. ;
Lauber, Christian L. ;
Hamady, Micah ;
Knight, Rob ;
Fierer, Noah .
ISME JOURNAL, 2009, 3 (04) :442-453
[7]   Metagenomic study of the oral microbiota by Illumina high-throughput sequencing [J].
Lazarevic, Vladimir ;
Whiteson, Katrine ;
Huse, Susan ;
Hernandez, David ;
Farinelli, Laurent ;
Osteras, Magne ;
Schrenzel, Jacques ;
Francois, Patrice .
JOURNAL OF MICROBIOLOGICAL METHODS, 2009, 79 (03) :266-271
[8]   Evolution of mammals and their gut microbes [J].
Ley, Ruth E. ;
Hamady, Micah ;
Lozupone, Catherine ;
Turnbaugh, Peter J. ;
Ramey, Rob Roy ;
Bircher, J. Stephen ;
Schlegel, Michael L. ;
Tucker, Tammy A. ;
Schrenzel, Mark D. ;
Knight, Rob ;
Gordon, Jeffrey I. .
SCIENCE, 2008, 320 (5883) :1647-1651
[9]   Analyzing Endodontic Infections by Deep Coverage Pyrosequencing [J].
Li, L. ;
Hsiao, W. W. L. ;
Nandakumar, R. ;
Barbuto, S. M. ;
Mongodin, E. F. ;
Paster, B. J. ;
Fraser-Liggett, C. M. ;
Fouad, A. F. .
JOURNAL OF DENTAL RESEARCH, 2010, 89 (09) :980-984
[10]   Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences [J].
Li, Weizhong ;
Godzik, Adam .
BIOINFORMATICS, 2006, 22 (13) :1658-1659