Tolerating some redundancy significantly speeds up clustering of large protein databases

被引:415
作者
Li, WZ [1 ]
Jaroszewski, L [1 ]
Godzik, A [1 ]
机构
[1] Burnham Inst, La Jolla, CA 92037 USA
关键词
D O I
10.1093/bioinformatics/18.1.77
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Sequence clustering replaces groups of similar sequences in a database with single representatives. Clustering large protein databases like the NCBI Non-Redundant database (NR) using even the best currently available clustering algorithms is very time-consuming and only practical at relatively high sequence identity thresholds. Our previous program, CD-HI, clustered NR at 90% identity in similar to1 h and at 75% identity in similar to1 day on a 1 GHz Linux PC (Li et al., Bioinformatics, 17, 282, 2001); however even faster clustering speed is needed because the size of protein databases are rapidly growing and many applications desire a lower attainable thresholds. Results: For our previous algorithm (CD-HI), we have employed short-word filters to speed up the clustering. In this paper, we show that tolerating some redundancy makes for more efficient use of these short-word filters and increases the program's speed 100 times. Our new program implements this technique and clusters NR at 70% identity within 2 h, and at 50% identity in similar to5 days. Although some redundancy is present after clustering, our new program's results only differ from our previous program's by less than 0.4%.
引用
收藏
页码:77 / 82
页数:6
相关论文
共 7 条
  • [1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [2] The Pfam protein families database
    Bateman, A
    Birney, E
    Durbin, R
    Eddy, SR
    Howe, KL
    Sonnhammer, ELL
    [J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (01) : 263 - 266
  • [3] The Protein Data Bank
    Berman, HM
    Westbrook, J
    Feng, Z
    Gilliland, G
    Bhat, TN
    Weissig, H
    Shindyalov, IN
    Bourne, PE
    [J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (01) : 235 - 242
  • [4] HOBOHM U, 1992, PROTEIN SCI, V1, P409
  • [5] Removing near-neighbour redundancy from large protein sequence collections
    Holm, L
    Sander, C
    [J]. BIOINFORMATICS, 1998, 14 (05) : 423 - 429
  • [6] Clustering of highly homologous sequences to reduce the size of large protein databases
    Li, WZ
    Jaroszewski, L
    Godzik, A
    [J]. BIOINFORMATICS, 2001, 17 (03) : 282 - 283
  • [7] RSDB: representative protein sequence databases have high information content
    Park, J
    Holm, L
    Heger, A
    Chothia, C
    [J]. BIOINFORMATICS, 2000, 16 (05) : 458 - 464