OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups

被引:647
作者
Chen, Feng
Mackey, Aaron J.
Stoeckert, Christian J., Jr.
Roos, David S. [1 ]
机构
[1] Univ Penn, Dept Chem, Philadelphia, PA 19104 USA
[2] Univ Penn, Dept Biol, Philadelphia, PA 19104 USA
[3] Univ Penn, Dept Genet, Ctr Bioinformat, Penn Genom Inst, Philadelphia, PA 19104 USA
关键词
D O I
10.1093/nar/gkj123
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
The OrthoMCL database (http://orthomcl.cbil.upenn.edu) houses ortholog group predictions for 55 species, including 16 bacterial and 4 archaeal genomes representing phylogenetically diverse lineages, and most currently available complete eukaryotic genomes: 24 unikonts (12 animals, 9 fungi, microsporidium, Dictyostelium, Entamoeba), 4 plants/algae and 7 apicomplexan parasites. OrthoMCL software was used to cluster proteins based on sequence similarity, using an all-against-all BLAST search of each species' proteome, followed by normalization of inter-species differences, and Markov clustering. A total of 511797 proteins (81.6% of the total dataset) were clustered into 70388 ortholog groups. The ortholog database may be queried based on protein or group accession numbers, keyword descriptions or BLAST similarity. Ortholog groups exhibiting specific phyletic patterns may also be identified, using either a graphical interface or a text-based Phyletic Pattern Expression grammar. Information for ortholog groups includes the phyletic profile, the list of member proteins and a multiple sequence alignment, a statistical summary and graphical view of similarities, and a graphical representation of domain architecture. OrthoMCL software, the entire FASTA dataset employed and clustering results are available for download. OrthoMCL-DB provides a centralized warehouse for orthology prediction among multiple species, and will be updated and expanded as additional genome sequence data become available.
引用
收藏
页码:D363 / D368
页数:6
相关论文
共 18 条
  • [1] Altschul SF, 1996, METHOD ENZYMOL, V266, P460
  • [2] Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkr1065, 10.1093/nar/gkh121]
  • [3] Benson Dennis A, 2005, Nucleic Acids Res, V33, pD34
  • [4] Ensembl 2004
    Birney, E
    Andrews, D
    Bevan, P
    Caccamo, M
    Cameron, G
    Chen, Y
    Clarke, L
    Coates, G
    Cox, T
    Cuff, J
    Curwen, V
    Cutts, T
    Down, T
    Durbin, R
    Eyras, E
    Fernandez-Suarez, XM
    Gane, P
    Gibbins, B
    Gilbert, J
    Hammond, M
    Hotz, H
    Iyer, V
    Kahari, A
    Jekosch, K
    Kasprzyk, A
    Keefe, D
    Keenan, S
    Lehvaslaiho, H
    McVicker, G
    Melsopp, C
    Meidl, P
    Mongin, E
    Pettett, R
    Potter, S
    Proctor, G
    Rae, M
    Searle, S
    Slater, G
    Smedley, D
    Smith, J
    Spooner, W
    Stabenau, A
    Stalker, J
    Storey, R
    Ureta-Vidal, A
    Woodwark, C
    Clamp, M
    Hubbard, T
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 : D468 - D470
  • [5] Dongen Svan, 2000, GRAPH CLUSTERING FLO
  • [6] MUSCLE: multiple sequence alignment with high accuracy and high throughput
    Edgar, RC
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 (05) : 1792 - 1797
  • [7] An efficient algorithm for large-scale detection of protein families
    Enright, AJ
    Van Dongen, S
    Ouzounis, CA
    [J]. NUCLEIC ACIDS RESEARCH, 2002, 30 (07) : 1575 - 1584
  • [8] Felsenstein J., 2005, PHYLIP PHYLOGENY INF, DOI DOI 10.1111/J.1096-0031.1989.TB00562.X
  • [9] Goldovsky Leon, 2005, Appl Bioinformatics, V4, P71, DOI 10.2165/00822942-200504010-00009
  • [10] KEELING PJ, IN PRESS TRENDS ECOL