Protein families and TRIBES in genome sequence space

被引:101
作者
Enright, AJ [1 ]
Kunin, V [1 ]
Ouzounis, CA [1 ]
机构
[1] EMBL Cambridge Outstn, Computat Genom Grp, European Bioinformat Inst, Cambridge CB10 1SD, England
关键词
D O I
10.1093/nar/gkg495
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Accurate detection of protein families allows assignment of protein function and the analysis of functional diversity in complete genomes. Recently, we presented a novel algorithm called TribeMCL for the detection of protein families that is both accurate and efficient. This method allows family analysis to be carried out on a very large scale. Using TribeMCL, we have generated a resource called Tribes that contains protein family information, comprising annotations, protein sequence alignments and phylogenetic distributions describing 311 257 proteins from 83 completely sequenced genomes. The analysis of at least 60 934 detected protein families reveals that, with the essential families excluded, paralogy levels are similar between prokaryotes, irrespective of genome size. The number of essential families is estimated to be between 366 and 426. We also show that the currently known space of protein families is scale free and discuss the implications of this distribution. In addition, we show that smaller families are often formed by shorter proteins and discuss the reasons for this intriguing pattern. Finally, we analyse the functional diversity of protein families in entire genome sequences. The Tribes protein family resource is accessible at http://www.ebi.ac.uk/research/cgg/tribes.
引用
收藏
页码:4632 / 4638
页数:7
相关论文
共 30 条
  • [1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [2] Functional classes in the three domains of life
    Andrade, MA
    Ouzounis, C
    Sander, C
    Tamames, J
    Valencia, A
    [J]. JOURNAL OF MOLECULAR EVOLUTION, 1999, 49 (05) : 551 - 557
  • [3] Automated genome sequence analysis and annotation
    Andrade, MA
    Brown, NP
    Leroy, C
    Hoersch, S
    de Daruvar, A
    Reich, C
    Franchini, A
    Tamames, J
    Valencia, A
    Ouzounis, C
    Sander, C
    [J]. BIOINFORMATICS, 1999, 15 (05) : 391 - 412
  • [4] The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000
    Bairoch, A
    Apweiler, R
    [J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (01) : 45 - 48
  • [5] Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkh121, 10.1093/nar/gkr1065]
  • [6] Mining the draft human genome
    Birney, E
    Bateman, A
    Clamp, ME
    Hubbard, TJ
    [J]. NATURE, 2001, 409 (6822) : 827 - 828
  • [7] THE RELATION BETWEEN THE DIVERGENCE OF SEQUENCE AND STRUCTURE IN PROTEINS
    CHOTHIA, C
    LESK, AM
    [J]. EMBO JOURNAL, 1986, 5 (04) : 823 - 826
  • [8] Devos D, 2000, PROTEINS, V41, P98, DOI 10.1002/1097-0134(20001001)41:1<98::AID-PROT120>3.0.CO
  • [9] 2-S
  • [10] SIMILAR AMINO-ACID-SEQUENCES - CHANCE OR COMMON ANCESTRY
    DOOLITTLE, RF
    [J]. SCIENCE, 1981, 214 (4517) : 149 - 159