Uniclust databases of clustered and deeply annotated protein sequences and alignments

被引:428
|
作者
Mirdita, Milot [1 ]
von den Driesch, Lars [1 ,2 ]
Galiez, Clovis [1 ]
Martin, Maria J. [2 ]
Soeding, Johannes [1 ]
Steinegger, Martin [1 ,3 ,4 ]
机构
[1] Max Planck Inst Biophys Chem, Quantitat & Computat Biol Grp, Gottingen, Germany
[2] EBI, EMBL, Wellcome Trust Genome Campus, Cambridge, England
[3] Tech Univ Munich, Dept Bioinformat & Computat Biol, Munich, Germany
[4] Seoul Natl Univ, Dept Chem, Seoul, South Korea
基金
欧洲研究理事会;
关键词
D O I
10.1093/nar/gkw1081
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
We present three clustered protein sequence databases, Uniclust90, Uniclust50, Uniclust30 and three databases of multiple sequence alignments (MSAs), Uniboost10, Uniboost20 and Uniboost30, as a resource for protein sequence analysis, function prediction and sequence searches. The Uniclust databases cluster UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity. Uniclust90 and Uniclust50 clusters showed better consistency of functional annotation than those of UniRef90 and UniRef50, owing to an optimised clustering pipeline that runs with our MMseqs2 software for fast and sensitive protein sequence searching and clustering. Uniclust sequences are annotated with matches to Pfam, SCOP domains, and proteins in the PDB, using our HHblits homology detection tool. Due to its high sensitivity, Uniclust contains 17% more Pfam domain annotations than UniProt. Uni-boost MSAs of three diversities are built by enriching the Uniclust30 MSAs with local sequence matches from MMseqs2 profile searches through Uniclust30. All databases can be downloaded from the Uniclust server at uniclust. mmseqs. com. Users can search clusters by keywords and explore their MSAs, taxonomic representation, and annotations. Uniclust is updated every two months with the new UniProt release.
引用
收藏
页码:D170 / D176
页数:7
相关论文
共 50 条
  • [1] Comparison of NR and UniClust Databases for Protein Secondary Structure Prediction
    Aydin, Zafer
    Kaynar, Oguz
    Gormez, Yasin
    2018 26TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2018,
  • [2] Probabilistic description of protein alignments for sequences and structures
    Koike, R
    Kinoshita, K
    Kidera, A
    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2004, 56 (01) : 157 - 166
  • [3] Post-processing of BLAST results using databases of clustered sequences
    Miller, GS
    Fuchs, R
    COMPUTER APPLICATIONS IN THE BIOSCIENCES, 1997, 13 (01): : 81 - 87
  • [4] Alignments of DNA and protein sequences containing frameshift errors
    Guan, XJ
    Uberbacher, EC
    COMPUTER APPLICATIONS IN THE BIOSCIENCES, 1996, 12 (01): : 31 - 40
  • [5] How well are protein structures annotated in secondary databases?
    Rother, K
    Michalsky, E
    Leser, U
    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2005, 60 (04) : 571 - 576
  • [6] Hidden Markov models and multiple alignments of protein sequences
    Goldstein, P
    Karaga, M
    Kosor, M
    Nizetic, I
    Tadic, M
    Vlah, D
    Proceedings of the Conference on Applied Mathematics and Scientific Computing, 2005, : 187 - 196
  • [7] The PSSH database of alignments between protein sequences and tertiary structures
    Schafferhans, A
    Meyer, JEW
    O'Donoghue, SI
    NUCLEIC ACIDS RESEARCH, 2003, 31 (01) : 494 - 498
  • [8] Clustal Omega for making accurate alignments of many protein sequences
    Sievers, Fabian
    Higgins, Desmond G.
    PROTEIN SCIENCE, 2018, 27 (01) : 135 - 145
  • [9] A SIMPLE METHOD TO GENERATE NONTRIVIAL ALTERNATE ALIGNMENTS OF PROTEIN SEQUENCES
    SAQI, MAS
    STERNBERG, MJE
    JOURNAL OF MOLECULAR BIOLOGY, 1991, 219 (04) : 727 - 732
  • [10] Estimation of P-values for global alignments of protein sequences
    Webber, C
    Barton, GJ
    BIOINFORMATICS, 2001, 17 (12) : 1158 - 1167