Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences

被引:85
作者
Mizianty, Marcin J. [1 ]
Kurgan, Lukasz [1 ]
机构
[1] Univ Alberta, Dept Elect & Comp Engn, Edmonton, AB, Canada
来源
BMC BIOINFORMATICS | 2009年 / 10卷
关键词
AMINO-ACID-COMPOSITION; SUPPORT VECTOR MACHINES; SECONDARY STRUCTURE-CONTENT; TERTIARY STRUCTURE; STATISTICAL-METHOD; FOLDING RATES; DOMAIN; DATABASE; EVOLUTIONARY; CLASSIFIER;
D O I
10.1186/1471-2105-10-414
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Knowledge of structural class is used by numerous methods for identification of structural/functional characteristics of proteins and could be used for the detection of remote homologues, particularly for chains that share twilight-zone similarity. In contrast to existing sequence-based structural class predictors, which target four major classes and which are designed for high identity sequences, we predict seven classes from sequences that share twilight-zone identity with the training sequences. Results: The proposed MODular Approach to Structural class prediction (MODAS) method is unique as it allows for selection of any subset of the classes. MODAS is also the first to utilize a novel, custom-built feature-based sequence representation that combines evolutionary profiles and predicted secondary structure. The features quantify information relevant to the definition of the classes including conservation of residues and arrangement and number of helix/strand segments. Our comprehensive design considers 8 feature selection methods and 4 classifiers to develop Support Vector Machine-based classifiers that are tailored for each of the seven classes. Tests on 5 twilight-zone and 1 high-similarity benchmark datasets and comparison with over two dozens of modern competing predictors show that MODAS provides the best overall accuracy that ranges between 80% and 96.7% (83.5% for the twilight-zone datasets), depending on the dataset. This translates into 19% and 8% error rate reduction when compared against the best performing competing method on two largest datasets. The proposed predictor provides accurate predictions at 58% accuracy for membrane proteins class, which is not considered by majority of existing methods, in spite that this class accounts for only 2% of the data. Our predictive model is analyzed to demonstrate how and why the input features are associated with the corresponding classes. Conclusions: The improved predictions stem from the novel features that express collocation of the secondary structure segments in the protein sequence and that combine evolutionary and secondary structure information. Our work demonstrates that conservation and arrangement of the secondary structure segments predicted along the protein chain can successfully predict structural classes which are defined based on the spatial arrangement of the secondary structures. A web server is available at http://biomine.ece.ualberta.ca/MODAS/.
引用
收藏
页数:24
相关论文
共 121 条
  • [1] AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759
  • [2] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [3] Study and prediction of secondary structure for membrane proteins
    Amirova, Svetlana R.
    Milchevsky, Juri V.
    Filatov, Ivan V.
    Esipova, Natalia G.
    Tumanyan, Vladimir G.
    [J]. JOURNAL OF BIOMOLECULAR STRUCTURE & DYNAMICS, 2007, 24 (04) : 421 - 427
  • [4] Predicting protein structural class by SVM with class-wise optimized features and decision probabilities
    Anand, Ashish
    Pugalenthi, Ganesan
    Suganthan, P. N.
    [J]. JOURNAL OF THEORETICAL BIOLOGY, 2008, 253 (02) : 375 - 380
  • [5] SCOP database in 2004: refinements integrate structure and sequence family data
    Andreeva, A
    Howorth, D
    Brenner, SE
    Hubbard, TJP
    Chothia, C
    Murzin, AG
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 : D226 - D229
  • [6] [Anonymous], DATA MINING PRACTICA
  • [7] [Anonymous], 1997, P 14 INT C MACH LEAR
  • [8] The InterPro database, an integrated documentation resource for protein families, domains and functional sites
    Apweiler, R
    Attwood, TK
    Bairoch, A
    Bateman, A
    Birney, E
    Biswas, M
    Bucher, P
    Cerutti, T
    Corpet, F
    Croning, MDR
    Durbin, R
    Falquet, L
    Fleischmann, W
    Gouzy, J
    Hermjakob, H
    Hulo, N
    Jonassen, I
    Kahn, D
    Kanapin, A
    Karavidopoulou, Y
    Lopez, R
    Marx, B
    Mulder, NJ
    Oinn, TM
    Pagni, M
    Servant, F
    Sigrist, CJA
    Zdobnov, EM
    [J]. NUCLEIC ACIDS RESEARCH, 2001, 29 (01) : 37 - 40
  • [9] Assessing the accuracy of prediction algorithms for classification: an overview
    Baldi, P
    Brunak, S
    Chauvin, Y
    Andersen, CAF
    Nielsen, H
    [J]. BIOINFORMATICS, 2000, 16 (05) : 412 - 424
  • [10] The Protein Data Bank
    Berman, HM
    Westbrook, J
    Feng, Z
    Gilliland, G
    Bhat, TN
    Weissig, H
    Shindyalov, IN
    Bourne, PE
    [J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (01) : 235 - 242