Uniclust databases of clustered and deeply annotated protein sequences and alignments

被引:429
作者
Mirdita, Milot [1 ]
von den Driesch, Lars [1 ,2 ]
Galiez, Clovis [1 ]
Martin, Maria J. [2 ]
Soeding, Johannes [1 ]
Steinegger, Martin [1 ,3 ,4 ]
机构
[1] Max Planck Inst Biophys Chem, Quantitat & Computat Biol Grp, Gottingen, Germany
[2] EBI, EMBL, Wellcome Trust Genome Campus, Cambridge, England
[3] Tech Univ Munich, Dept Bioinformat & Computat Biol, Munich, Germany
[4] Seoul Natl Univ, Dept Chem, Seoul, South Korea
基金
欧洲研究理事会;
关键词
D O I
10.1093/nar/gkw1081
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
We present three clustered protein sequence databases, Uniclust90, Uniclust50, Uniclust30 and three databases of multiple sequence alignments (MSAs), Uniboost10, Uniboost20 and Uniboost30, as a resource for protein sequence analysis, function prediction and sequence searches. The Uniclust databases cluster UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity. Uniclust90 and Uniclust50 clusters showed better consistency of functional annotation than those of UniRef90 and UniRef50, owing to an optimised clustering pipeline that runs with our MMseqs2 software for fast and sensitive protein sequence searching and clustering. Uniclust sequences are annotated with matches to Pfam, SCOP domains, and proteins in the PDB, using our HHblits homology detection tool. Due to its high sensitivity, Uniclust contains 17% more Pfam domain annotations than UniProt. Uni-boost MSAs of three diversities are built by enriching the Uniclust30 MSAs with local sequence matches from MMseqs2 profile searches through Uniclust30. All databases can be downloaded from the Uniclust server at uniclust. mmseqs. com. Users can search clusters by keywords and explore their MSAs, taxonomic representation, and annotations. Uniclust is updated every two months with the new UniProt release.
引用
收藏
页码:D170 / D176
页数:7
相关论文
共 24 条
  • [1] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [2] [Anonymous], 2005, P ISMB 2005 SIG M BI
  • [3] UniProt: a hub for protein information
    Bateman, Alex
    Martin, Maria Jesus
    O'Donovan, Claire
    Magrane, Michele
    Apweiler, Rolf
    Alpi, Emanuele
    Antunes, Ricardo
    Arganiska, Joanna
    Bely, Benoit
    Bingley, Mark
    Bonilla, Carlos
    Britto, Ramona
    Bursteinas, Borisas
    Chavali, Gayatri
    Cibrian-Uhalte, Elena
    Da Silva, Alan
    De Giorgi, Maurizio
    Dogan, Tunca
    Fazzini, Francesco
    Gane, Paul
    Cas-tro, Leyla Garcia
    Garmiri, Penelope
    Hatton-Ellis, Emma
    Hieta, Reija
    Huntley, Rachael
    Legge, Duncan
    Liu, Wudong
    Luo, Jie
    MacDougall, Alistair
    Mutowo, Prudence
    Nightin-gale, Andrew
    Orchard, Sandra
    Pichler, Klemens
    Poggioli, Diego
    Pundir, Sangya
    Pureza, Luis
    Qi, Guoying
    Rosanoff, Steven
    Saidi, Rabie
    Sawford, Tony
    Shypitsyna, Aleksandra
    Turner, Edward
    Volynkin, Vladimir
    Wardell, Tony
    Watkins, Xavier
    Zellner, Hermann
    Cowley, Andrew
    Figueira, Luis
    Li, Weizhong
    McWilliam, Hamish
    [J]. NUCLEIC ACIDS RESEARCH, 2015, 43 (D1) : D204 - D212
  • [4] Benson DA, 2010, NUCLEIC ACIDS RES, V38, pD46, DOI [10.1093/nar/gkp1024, 10.1093/nar/gkx1094, 10.1093/nar/gkl986, 10.1093/nar/gkw1070, 10.1093/nar/gks1195, 10.1093/nar/gkn723, 10.1093/nar/gkg057, 10.1093/nar/gkr1202, 10.1093/nar/gkq1079]
  • [5] Announcing the worldwide Protein Data Bank
    Berman, H
    Henrick, K
    Nakamura, H
    [J]. NATURE STRUCTURAL BIOLOGY, 2003, 10 (12) : 980 - 980
  • [6] D3: Data-Driven Documents
    Bostock, Michael
    Ogievetsky, Vadim
    Heer, Jeffrey
    [J]. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2011, 17 (12) : 2301 - 2309
  • [7] Sequencing delivers diminishing returns for homology detection: implications for mapping the protein universe
    Chubb, Daniel
    Jefferys, Benjamin R.
    Sternberg, Michael J. E.
    Kelley, Lawrence A.
    [J]. BIOINFORMATICS, 2010, 26 (21) : 2664 - 2671
  • [8] The Pfam protein families database: towards a more sustainable future
    Finn, Robert D.
    Coggill, Penelope
    Eberhardt, Ruth Y.
    Eddy, Sean R.
    Mistry, Jaina
    Mitchell, Alex L.
    Potter, Simon C.
    Punta, Marco
    Qureshi, Matloob
    Sangrador-Vegas, Amaia
    Salazar, Gustavo A.
    Tate, John
    Bateman, Alex
    [J]. NUCLEIC ACIDS RESEARCH, 2016, 44 (D1) : D279 - D285
  • [9] Gene Ontology Consortium, 2015, NUCLEIC ACIDS RES, V43, pD1049
  • [10] MycoCosm portal: gearing up for 1000 fungal genomes
    Grigoriev, Igor V.
    Nikitin, Roman
    Haridas, Sajeet
    Kuo, Alan
    Ohm, Robin
    Otillar, Robert
    Riley, Robert
    Salamov, Asaf
    Zhao, Xueling
    Korzeniewski, Frank
    Smirnova, Tatyana
    Nordberg, Henrik
    Dubchak, Inna
    Shabalov, Igor
    [J]. NUCLEIC ACIDS RESEARCH, 2014, 42 (D1) : D699 - D704