Association classification algorithm based on structure sequence in protein secondary structure prediction

被引:15
作者
Zhou, Zhun [1 ]
Yang, Bingru [2 ]
Hou, Wei [2 ]
机构
[1] Tsinghua Univ, Dept Environm Sci & Engn, Beijing 100084, Peoples R China
[2] Univ Sci & Technol Beijing, Dept Informat Engn, Beijing 100083, Peoples R China
关键词
Association classification; Protein secondary structure prediction; KDD center dot (Knowledge Discovery(center dot)); Compound pyramid model; BIOINFORMATICS;
D O I
10.1016/j.eswa.2010.02.081
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Objective: To propose a novel associate classification algorithm SAC (structural association classification) and develop a compound pyramid model for accurate and precise protein secondary structure prediction. Method: Based on the slide window theory, the protein sequence was treated as a window with length of 13, in which the target amino acid resided in the center, while the remaining area was targeted as secondary amino acid structures. To the head and tail of the sequence, the mirror method was employed to fill the space with an opposite- position structure in relation to the central position. In the mining process, the KDD center dot model not only focuses on the high support and confidence rules, but also pay attention to high confidence and low support rules, which is called 'knowledge in shortage'. Towards the end of the mining process, sets H, E and C, consisted of rule sets whose consequents are alpha-helix, beta-sheet and C-coil, were created respectively to meet the basic requirements for the protein secondary structure prediction. The knowledge base of protein secondary structure was then established with these three newly-acquired rule sets. Through the CMAR (Classification based on Multiple Association rules) algorithm, a novel multi-classifier was developed to determine the best likelihood of a given window to the secondary structure through the adjacent information on amino acid sequential window and screening of three different rule sets. Result: The protein knowledge base consisted of 8049 rules corresponding to sets H, E and C with 2642, 1895 and 3512 rules, respectively, was obtained. Experiment shows, theoretically, accuracy ratio exceeded 85% when confidence threshold value was 70% and 90%. Through the classification process using the multi-classifier SAC developed in four experiments, the significantly high accuracy and recall ratios up to 83.06% (According to Q(3) criterion, followed by abbreviation) in RS126 (Chen & Chaudhari, 2007; Guo et al., 2004; Hu et al., 2004; Liu et al., 2004) and 80.49% in CB513 (Guo et al., 2004; Liu et al., 2004; Wang & Liu (2004)). respectively, were demonstrated. Conclusion: The structural association classification algorithm with pyramid classification developed in the present study demonstrated significantly high accuracy in the protein secondary structure prediction. The study results suggest a highly reliable and accurate alternative in the contemporary protein structure prediction. (C) 2010 Elsevier Ltd. All rights reserved.
引用
收藏
页码:6381 / 6389
页数:9
相关论文
共 14 条
[1]  
[Anonymous], J COMPUTER RES DEV
[2]   Cascaded bidirectional recurrent neural networks for protein secondary structure prediction [J].
Chen, Jinmiao ;
Chaudhari, Narendra S. .
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2007, 4 (04) :572-582
[3]   Bioinformatics - An introduction for computer scientists [J].
Cohen, J .
ACM COMPUTING SURVEYS, 2004, 36 (02) :122-158
[4]   A novel method for protein secondary structure prediction using dual-layer SVM and profiles [J].
Guo, J ;
Chen, H ;
Sun, ZR ;
Lin, YL .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2004, 54 (04) :738-743
[5]   Bioinformatics and data mining in proteomics [J].
Haoudi, Abdelali ;
Bensmail, Halima .
EXPERT REVIEW OF PROTEOMICS, 2006, 3 (03) :333-343
[6]   Improved protein secondary structure prediction using support vector machine with a new encoding scheme and an advanced tertiary classifier [J].
Hu, HJ ;
Pan, Y ;
Harrison, R ;
Tai, PC .
IEEE TRANSACTIONS ON NANOBIOSCIENCE, 2004, 3 (04) :265-271
[7]   HYPROSP II - A knowledge-based hybrid method for protein secondary structure prediction based on local prediction confidence [J].
Lin, HN ;
Chang, JM ;
Wu, KP ;
Sung, TY ;
Hsu, WL .
BIOINFORMATICS, 2005, 21 (15) :3227-3233
[8]  
LIU Y, 2004, CONTEXT SENSITIVE VO
[9]  
LONGFEI Y, 1999, PROTEIN MOL STRUCTUR
[10]   From genome to function [J].
Thornton, JM .
SCIENCE, 2001, 292 (5524) :2095-+