SEQOPTICS: a protein sequence clustering system

被引:14
作者
Chen, Yonghui [1 ]
Reilly, Kevin D.
Sprague, Alan P.
Guan, Zhijie
机构
[1] Univ Alabama Birmingham, Dept Comp & Informat Sci, Birmingham, AL 35294 USA
[2] Univ Calif San Diego, San Diego Supercomp Ctr, La Jolla, CA 92093 USA
关键词
D O I
10.1186/1471-2105-7-S4-S10
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Protein sequence clustering has been widely used as a part of the analysis of protein structure and function. In most cases single linkage or graph-based clustering algorithms have been applied. OPTICS (Ordering Points To Identify the Clustering Structure) is an attractive approach due to its emphasis on visualization of results and support for interactive work, e. g., in choosing parameters. However, OPTICS has not been used, as far as we know, for protein sequence clustering. Results: In this paper, a system of clustering proteins, SEQOPTICS (SEQuence clustering with OPTICS) is demonstrated. The system is implemented with Smith-Waterman as protein distance measurement and OPTICS at its core to perform protein sequence clustering. SEQOPTICS is tested with four data sets from different data sources. Visualization of the sequence clustering structure is demonstrated as well. Conclusion: The system was evaluated by comparison with other existing methods. Analysis of the results demonstrates that SEQOPTICS performs better based on some evaluation criteria including Jaccard coefficient, Precision, and Recall. It is a promising protein sequence clustering method with future possible improvement on parallel computing and other protein distance measurements.
引用
收藏
页数:9
相关论文
共 21 条
  • [1] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [2] Ankerst M., 1999, SIGMOD Record, V28, P49, DOI 10.1145/304181.304187
  • [3] ARNIT B, 1998, ALGORITHMS SCORING C
  • [4] The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000
    Bairoch, A
    Apweiler, R
    [J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (01) : 45 - 48
  • [5] Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkr1065, 10.1093/nar/gkh121]
  • [6] GenBank: update
    Benson, DA
    Karsch-Mizrachi, I
    Lipman, DJ
    Ostell, J
    Wheeler, DL
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 : D23 - D26
  • [7] GeneRAGE: a robust algorithm for sequence clustering and domain detection
    Enright, AJ
    Ouzounis, CA
    [J]. BIOINFORMATICS, 2000, 16 (05) : 451 - 457
  • [8] On clustering validation techniques
    Halkidi, M
    Batistakis, Y
    Vazirgiannis, M
    [J]. JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2001, 17 (2-3) : 107 - 145
  • [9] Jaccard P., 1908, Bull. Soc. Vaud. Sci. Natur. Bull. Vaud. Soc. Nat. Sci., V44, P223, DOI DOI 10.5169/SEALS-268384
  • [10] KIM S, 2006, INT J DATA MINING BI, V1