A De-Novo Genome Analysis Pipeline (DeNoGAP) for large-scale comparative prokaryotic genomics studies

被引:6
作者
Thakur, Shalabh [1 ]
Guttman, David S. [1 ,2 ]
机构
[1] Univ Toronto, Dept Cell & Syst Biol, Toronto, ON, Canada
[2] Univ Toronto, Ctr Anal Genome Evolut & Funct, Toronto, ON, Canada
来源
BMC BIOINFORMATICS | 2016年 / 17卷
关键词
Comparative genomics; Prokaryotes; Gene prediction; Gene annotation; Ortholog identification; Functional annotation; Pan genome; Core genome; Flexible genome; FUNCTION PREDICTION; SEQUENCE-ANALYSIS; PROTEIN; GENE; ORTHOLOG; DATABASE; IDENTIFICATION; ALIGNMENT; QUEST; TREES;
D O I
10.1186/s12859-016-1142-2
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Comparative analysis of whole genome sequence data from closely related prokaryotic species or strains is becoming an increasingly important and accessible approach for addressing both fundamental and applied biological questions. While there are number of excellent tools developed for performing this task, most scale poorly when faced with hundreds of genome sequences, and many require extensive manual curation. Results: We have developed a de-novo genome analysis pipeline (DeNoGAP) for the automated, iterative and high-throughput analysis of data from comparative genomics projects involving hundreds of whole genome sequences. The pipeline is designed to perform reference-assisted and de novo gene prediction, homolog protein family assignment, ortholog prediction, functional annotation, and pan-genome analysis using a range of proven tools and databases. While most existing methods scale quadratically with the number of genomes since they rely on pairwise comparisons among predicted protein sequences, DeNoGAP scales linearly since the homology assignment is based on iteratively refined hidden Markov models. This iterative clustering strategy enables DeNoGAP to handle a very large number of genomes using minimal computational resources. Moreover, the modular structure of the pipeline permits easy updates as new analysis programs become available. Conclusion: DeNoGAP integrates bioinformatics tools and databases for comparative analysis of a large number of genomes. The pipeline offers tools and algorithms for annotation and analysis of completed and draft genome sequences. The pipeline is developed using Perl, BioPerl and SQLite on Ubuntu Linux version 12.04 LTS. Currently, the software package accompanies script for automated installation of necessary external programs on Ubuntu Linux; however, the pipeline should be also compatible with other Linux and Unix systems after necessary external programs are installed.
引用
收藏
页数:18
相关论文
共 68 条
  • [1] The PhyloFacts FAT-CAT web server: ortholog identification and function prediction using fast approximate tree classification
    Afrasiabi, Cyrus
    Samad, Bushra
    Dineen, David
    Meacham, Christopher
    Sjoelander, Kimmen
    [J]. NUCLEIC ACIDS RESEARCH, 2013, 41 (W1) : W242 - W248
  • [2] Ali A., 2013, J Bacteriol Parasitol, V4, P2
  • [3] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [4] Gene Ontology: tool for the unification of biology
    Ashburner, M
    Ball, CA
    Blake, JA
    Botstein, D
    Butler, H
    Cherry, JM
    Davis, AP
    Dolinski, K
    Dwight, SS
    Eppig, JT
    Harris, MA
    Hill, DP
    Issel-Tarver, L
    Kasarskis, A
    Lewis, S
    Matese, JC
    Richardson, JE
    Ringwald, M
    Rubin, GM
    Sherlock, G
    [J]. NATURE GENETICS, 2000, 25 (01) : 25 - 29
  • [5] Dynamic Evolution of Pathogenicity Revealed by Sequencing and Comparative Genomics of 19 Pseudomonas syringae Isolates
    Baltrus, David A.
    Nishimura, Marc T.
    Romanchuk, Artur
    Chang, Jeff H.
    Mukhtar, M. Shahid
    Cherkis, Karen
    Roach, Jeff
    Grant, Sarah R.
    Jones, Corbin D.
    Dangl, Jeffery L.
    [J]. PLOS PATHOGENS, 2011, 7 (07)
  • [6] Improved prediction of signal peptides: SignalP 3.0
    Bendtsen, JD
    Nielsen, H
    von Heijne, G
    Brunak, S
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 2004, 340 (04) : 783 - 795
  • [7] GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses
    Besemer, J
    Borodovsky, M
    [J]. NUCLEIC ACIDS RESEARCH, 2005, 33 : W451 - W454
  • [8] Boutet Emmanuel, 2007, V406, P89
  • [9] The complete genome sequence of the Arabidopsis and tomato pathogen Pseudomonas syringae pv. tomato DC3000
    Buell, CR
    Joardar, V
    Lindeberg, M
    Selengut, J
    Paulsen, IT
    Gwinn, ML
    Dodson, RJ
    Deboy, RT
    Durkin, AS
    Kolonay, JF
    Madupu, R
    Daugherty, S
    Brinkac, L
    Beanan, MJ
    Haft, DH
    Nelson, WC
    Davidsen, T
    Zafar, N
    Zhou, LW
    Liu, J
    Yuan, QP
    Khouri, H
    Fedorova, N
    Tran, B
    Russell, D
    Berry, K
    Utterback, T
    Van Aken, SE
    Feldblyum, TV
    D'Ascenzo, M
    Deng, WL
    Ramos, AR
    Alfano, JR
    Cartinhour, S
    Chatterjee, AK
    Delaney, TP
    Lazarowitz, SG
    Martin, GB
    Schneider, DJ
    Tang, XY
    Bender, CL
    White, O
    Fraser, CM
    Collmer, A
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2003, 100 (18) : 10181 - 10186
  • [10] The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases
    Caspi, Ron
    Altman, Tomer
    Billington, Richard
    Dreher, Kate
    Foerster, Hartmut
    Fulcher, Carol A.
    Holland, Timothy A.
    Keseler, Ingrid M.
    Kothari, Anamika
    Kubo, Aya
    Krummenacker, Markus
    Latendresse, Mario
    Mueller, Lukas A.
    Ong, Quang
    Paley, Suzanne
    Subhraveti, Pallavi
    Weaver, Daniel S.
    Weerasinghe, Deepika
    Zhang, Peifen
    Karp, Peter D.
    [J]. NUCLEIC ACIDS RESEARCH, 2014, 42 (D1) : D459 - D471