An assessment of genome annotation coverage across the bacterial tree of life

被引:74
作者
Lobb, Briallen [1 ]
Tremblay, Benjamin Jean-Marie [1 ]
Moreno-Hagelsieb, Gabriel [2 ]
Doxey, Andrew C. [1 ]
机构
[1] Univ Waterloo, Dept Biol, 200 Univ Ave West, Waterloo, ON N2L 3G1, Canada
[2] Wilfrid Laurier Univ, Dept Biol, 75 Univ Ave West, Waterloo, ON, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
genome annotation; functional annotation; gene function prediction; bacterial genomics; phylogenomics; tree of life; CONSERVED HYPOTHETICAL PROTEINS; DATABASE; GENES; SCALE; PSEUDOGENES; BUCHNERA; SEQUENCE; VIEW; CELL; SET;
D O I
10.1099/mgen.0.000341
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Although gene-finding in bacterial genomes is relatively straightforward, the automated assignment of gene function is still challenging, resulting in a vast quantity of hypothetical sequences of unknown function. But how prevalent are hypothetical sequences across bacteria, what proportion of genes in different bacterial genomes remain unannotated, and what factors affect annotation completeness? To address these questions, we surveyed over 27000 bacterial genomes from the Genome Taxonomy Database, and measured genome annotation completeness as a function of annotation method, taxonomy, genome size, 'research bias' and publication date. Our analysis revealed that 52 and 79 % of the average bacterial proteome could be functionally annotated based on protein and domain-based homology searches, respectively. Annotation coverage using protein homology search varied significantly from as low as 14% in some species to as high as 98 % in others. We found that taxonomy is a major factor influencing annotation completeness, with distinct trends observed across the microbial tree (e.g. the lowest level of completeness was found in the Patescibacteria lineage). Most lineages showed a significant association between genome size and annotation incompleteness, likely reflecting a greater degree of uncharacterized sequences in 'accessory' proteomes than in 'core' proteomes. Finally, research bias, as measured by publication volume, was also an important factor influencing genome annotation completeness, with early model organisms showing high completeness levels relative to other genomes in their own taxonomic lineages. Our work highlights the disparity in annotation coverage across the bacterial tree of life and emphasizes a need for more experimental characterization of accessory proteomes as well as understudied lineages.
引用
收藏
页数:11
相关论文
共 51 条
[1]  
[Anonymous], GENOME ACCESSION NUM
[2]  
Arakawa Kazuharu, 2006, In Silico Biology, V6, P49
[3]   Physical map and genome sequencing survey of Mycoplasma haemofelis (Haemobartonella felis) [J].
Berent, LM ;
Messick, JB .
INFECTION AND IMMUNITY, 2003, 71 (06) :3657-3662
[4]   The Evolution of Bacterial Genome Architecture [J].
Bobay, Louis-Marie ;
Ochman, Howard .
FRONTIERS IN GENETICS, 2017, 8
[5]   Fast and sensitive protein alignment using DIAMOND [J].
Buchfink, Benjamin ;
Xie, Chao ;
Huson, Daniel H. .
NATURE METHODS, 2015, 12 (01) :59-60
[6]   A hidden reservoir of integrative elements is the major source of recently acquired foreign genes and ORFans in archaeal and bacterial genomes [J].
Cortez, Diego ;
Forterre, Patrick ;
Gribaldo, Simonetta .
GENOME BIOLOGY, 2009, 10 (06)
[7]   Unknown unknowns: essential genes in quest for function [J].
Danchin, Antoine ;
Fang, Gang .
MICROBIAL BIOTECHNOLOGY, 2016, 9 (05) :530-540
[8]   In Silico screening for functional candidates amongst hypothetical proteins [J].
Desler, Claus ;
Suravajhala, Prashanth ;
Sanderhoff, May ;
Rasmussen, Merete ;
Rasmussen, Lene Juel .
BMC BIOINFORMATICS, 2009, 10 :289
[9]   The Pfam protein families database: towards a more sustainable future [J].
Finn, Robert D. ;
Coggill, Penelope ;
Eberhardt, Ruth Y. ;
Eddy, Sean R. ;
Mistry, Jaina ;
Mitchell, Alex L. ;
Potter, Simon C. ;
Punta, Marco ;
Qureshi, Matloob ;
Sangrador-Vegas, Amaia ;
Salazar, Gustavo A. ;
Tate, John ;
Bateman, Alex .
NUCLEIC ACIDS RESEARCH, 2016, 44 (D1) :D279-D285
[10]   HMMER web server: interactive sequence similarity searching [J].
Finn, Robert D. ;
Clements, Jody ;
Eddy, Sean R. .
NUCLEIC ACIDS RESEARCH, 2011, 39 :W29-W37