Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering

被引:103
作者
Kelley, David R. [1 ,2 ,3 ]
Liu, Bo [1 ]
Delcher, Arthur L. [1 ]
Pop, Mihai [1 ]
Salzberg, Steven L. [4 ]
机构
[1] Univ Maryland, Inst Adv Comp Studies, Ctr Bioinformat & Computat Biol, Dept Comp Sci, College Pk, MD 20742 USA
[2] 7 Cambridge Ctr, Broad Inst, Cambridge, MA 02142 USA
[3] Harvard Univ, Dept Stem Cell & Regenerat Biol, Cambridge, MA 02138 USA
[4] Johns Hopkins Univ, Sch Med, McKusick Nathans Inst Genet Med, Baltimore, MD USA
关键词
DATA SETS; DNA; MICROBIOME; IDENTIFICATION; GENERATION; PATTERNS; SITE;
D O I
10.1093/nar/gkr1067
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Environmental shotgun sequencing (or metagenomics) is widely used to survey the communities of microbial organisms that live in many diverse ecosystems, such as the human body. Finding the protein-coding genes within the sequences is an important step for assessing the functional capacity of a metagenome. In this work, we developed a metagenomics gene prediction system Glimmer-MG that achieves significantly greater accuracy than previous systems via novel approaches to a number of important prediction subtasks. First, we introduce the use of phylogenetic classifications of the sequences to model parameterization. We also cluster the sequences, grouping together those that likely originated from the same organism. Analogous to iterative schemes that are useful for whole genomes, we retrain our models within each cluster on the initial gene predictions before making final predictions. Finally, we model both insertion/deletion and substitution sequencing errors using a different approach than previous software, allowing Glimmer-MG to change coding frame or pass through stop codons by predicting an error. In a comparison among multiple gene finding methods, Glimmer-MG makes the most sensitive and precise predictions on simulated and real metagenomes for all read lengths and error rates tested.
引用
收藏
页数:12
相关论文
共 50 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   The GAAS Metagenomic Tool and Its Estimations of Viral and Microbial Average Genome Size in Four Major Biomes [J].
Angly, Florent E. ;
Willner, Dana ;
Prieto-Davo, Alejandra ;
Edwards, Robert A. ;
Schmieder, Robert ;
Vega-Thurber, Rebecca ;
Antonopoulos, Dionysios A. ;
Barott, Katie ;
Cottrell, Matthew T. ;
Desnues, Christelle ;
Dinsdale, Elizabeth A. ;
Furlan, Mike ;
Haynes, Matthew ;
Henn, Matthew R. ;
Hu, Yongfei ;
Kirchman, David L. ;
McDole, Tracey ;
McPherson, John D. ;
Meyer, Folker ;
Miller, R. Michael ;
Mundt, Egbert ;
Naviaux, Robert K. ;
Rodriguez-Mueller, Beltran ;
Stevens, Rick ;
Wegley, Linda ;
Zhang, Lixin ;
Zhu, Baoli ;
Rohwer, Forest .
PLOS COMPUTATIONAL BIOLOGY, 2009, 5 (12)
[3]  
[Anonymous], 2007, PLOS BIOL, DOI DOI 10.1371/journal.pbio.0050016
[4]   Characteristics of 454 pyrosequencing data-enabling realistic simulation with flowsim [J].
Balzer, Susanne ;
Malde, Ketil ;
Lanzen, Anders ;
Sharma, Animesh ;
Jonassen, Inge .
BIOINFORMATICS, 2010, 26 (18) :i420-i425
[5]   GenBank [J].
Benson, Dennis A. ;
Karsch-Mizrachi, Ilene ;
Lipman, David J. ;
Ostell, James ;
Sayers, Eric W. .
NUCLEIC ACIDS RESEARCH, 2011, 39 :D32-D37
[6]   DETECTION OF NEW GENES IN A BACTERIAL GENOME USING MARKOV-MODELS FOR 3 GENE CLASSES [J].
BORODOVSKY, M ;
MCININCH, JD ;
KOONIN, EV ;
RUDD, KE ;
MEDIGUE, C ;
DANCHIN, A .
NUCLEIC ACIDS RESEARCH, 1995, 23 (17) :3554-3562
[7]  
Brady A, 2009, NAT METHODS, V6, P673, DOI [10.1038/nmeth.1358, 10.1038/NMETH.1358]
[8]   Gene-centric metagenomics of the fiber-adherent bovine rumen microbiome reveals forage specific glycoside hydrolases [J].
Brulc, Jennifer M. ;
Antonopoulos, Dionysios A. ;
Miller, Margret E. Berg ;
Wilson, Melissa K. ;
Yannarell, Anthony C. ;
Dinsdale, Elizabeth A. ;
Edwards, Robert E. ;
Frank, Edward D. ;
Emerson, Joanne B. ;
Wacklin, Pirjo ;
Coutinho, Pedro M. ;
Henrissat, Bernard ;
Nelson, Karen E. ;
White, Bryan A. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2009, 106 (06) :1948-1953
[9]  
Chatterji S, 2008, LECT N BIOINFORMAT, V4955, P17
[10]   Bioinformatics for whole-genome shotgun sequencing of microbial communities [J].
Chen, K ;
Pachter, L .
PLOS COMPUTATIONAL BIOLOGY, 2005, 1 (02) :106-112