Improved K-means clustering algorithm for exploring local protein sequence motifs representing common structural property

被引:57
作者
Zhong, W
Altun, G
Harrison, R
Tai, PC
Pan, Y [1 ]
机构
[1] Georgia State Univ, Dept Comp Sci, Atlanta, GA 30303 USA
[2] Georgia State Univ, Dept Biol, Atlanta, GA 30303 USA
关键词
K-means clustering algorithm; protein structure; sequence motif;
D O I
10.1109/TNB.2005.853667
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Information about local protein sequence motifs is very important to the analysis of biologically significant conserved regions of protein sequences. These conserved regions can potentially determine the diverse conformation and activities of proteins. In this work, recurring sequence motifs of proteins are explored with an improved K-means clustering algorithm on a new dataset. The structural similarity of these recurring sequence clusters to produce sequence motifs is studied in order to evaluate the relationship between sequence motifs and their structures. To the best of our knowledge, the dataset used by our research is the most updated dataset among similar studies for sequence motifs. A new greedy initialization method for the K-means algorithm is proposed to improve traditional K-means clustering techniques. The new initialization method tries to choose suitable initial points, which are well separated and have the potential to form high-quality clusters. Our experiments indicate that the improved K-means algorithm satisfactorily increases the percentage of sequence segments belonging to clusters with high structural similarity. Careful comparison of sequence motifs obtained by the improved and traditional algorithms also suggests that the improved K-means clustering algorithm may discover some relatively weak and subtle sequence motifs, which are undetectable by the traditional K-means algorithms. Many biochemical tests reported in the literature show that these sequence motifs are biologically meaningful. Experimental results also indicate that the improved K-means algorithm generates more detailed sequence motifs representing common structures than previous research. Furthermore, these motifs are universally conserved sequence patterns across protein families, overcoming some weak points of other popular sequence motifs. The satisfactory result of the experiment suggests that this new K-means algorithm may be applied to other areas of bioinformatics research in order to explore the underlying relationships between data samples more effectively.
引用
收藏
页码:255 / 265
页数:11
相关论文
共 42 条
  • [1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [2] PRINTS and PRINTS-S shed light on protein ancestry
    Attwood, TK
    Blythe, MJ
    Flower, DR
    Gaulton, A
    Mabey, JE
    Maudling, N
    McGregor, L
    Mitchell, AL
    Moulton, G
    Paine, K
    Scordis, P
    [J]. NUCLEIC ACIDS RESEARCH, 2002, 30 (01) : 239 - 241
  • [3] Helix capping
    Aurora, R
    Rose, GD
    [J]. PROTEIN SCIENCE, 1998, 7 (01) : 21 - 38
  • [4] Determination of stereochemistry stability coefficients of amino acid side-chains in an amphipathic α-helix
    Chen, Y
    Mant, CT
    Hodges, RS
    [J]. JOURNAL OF PEPTIDE RESEARCH, 2002, 59 (01): : 18 - 33
  • [5] DEGRADO WF, 1988, ADV PROTEIN CHEM, V39, P51
  • [6] Durbin R., 1998, Biological sequence analysis: Probabilistic models of proteins and nucleic acids
  • [7] ERG JM, 2002, BIOCHEMISTRY-US, P53
  • [8] AMPHIPATHIC ANALYSIS AND POSSIBLE FORMATION OF THE ION CHANNEL IN AN ACETYLCHOLINE-RECEPTOR
    FINERMOORE, J
    STROUD, RM
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA-BIOLOGICAL SCIENCES, 1984, 81 (01): : 155 - 159
  • [9] Knowledge-based protein secondary structure assignment
    Frishman, D
    Argos, P
    [J]. PROTEINS-STRUCTURE FUNCTION AND GENETICS, 1995, 23 (04): : 566 - 579
  • [10] GUPTA SK, P DAT WAR KNOWL DISC, P203