Sequence/structure similarity and support vector machine for protein secondary structure prediction

被引:0
作者
Lin, JH
Tsai, CL
Lin, MR
机构
来源
8TH WORLD MULTI-CONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL XIII, PROCEEDINGS: INDUSTRIAL SYSTEMS | 2004年
关键词
support vector machine; protein sequences similarity; protein secondary structure prediction;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The majority of human coding regions have been sequenced and several genome sequencing projects have been completed. With large-scale of sequencing data growth, an efficient approach to analysis protein is more Important. Protein function and structure are foundations for drug design and protein based product. However, it's difficult to predict protein function and structure (three-dimension) directly from protein (amino acids) sequence. Therefore, analyzing protein secondary structure Is Indispensable. In the previous work, researchers always focused on classifying three states of protein secondary structure: helix, strand and coil classes. It's a common classification problem for the prediction of protein secondary structure. Comparing with other machine learning methods for this problem, many studies usually ignore the protein local sequence/structure properties. It concerns the accuracy of prediction bemuse there exists a large number of proteins that are homologous but whose sequences are only remotely related. In this paper, we propose to use sequence similarity and Support Vector Machines (SVMs) to predict protein secondary structure. First, we adopt RS126 and CB513 as experiment dataset. In this process, we try to encode the amino acids sequences and transform sequence segments into vectors for training. Second, we construct the SVM classifiers for classifying each residue of each sequence into the 3 secondary structure classes (Le. H, E, or C). SVM has been successfully applied in pattern recognition problem. SVMs are learning systems that use a hypothesis space of linear functions In a high dimensional feature space, trained with a learning algorithm from optimization theory that implements a learning bins derived from statistical learning theory. It's very suitable to compute with large-scale protein sequences. We have a better accuracy than traditional machine learning methods for protein secondary prediction.
引用
收藏
页码:71 / 76
页数:6
相关论文
共 28 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]  
[Anonymous], 1995, Machine Learning, DOI DOI 10.1023/A:1022627411411
[3]   Protein sequence databases [J].
Apweiler, R ;
Bairoch, A ;
Wu, CH .
CURRENT OPINION IN CHEMICAL BIOLOGY, 2004, 8 (01) :76-80
[4]   Exploiting the past and the future in protein secondary structure prediction [J].
Baldi, P ;
Brunak, S ;
Frasconi, P ;
Soda, G ;
Pollastri, G .
BIOINFORMATICS, 1999, 15 (11) :937-946
[5]  
Chandonia JM, 1999, PROTEINS, V35, P293
[6]   CONFORMATIONAL PARAMETERS FOR AMINO-ACIDS IN HELICAL, BETA-SHEET, AND RANDOM COIL REGIONS CALCULATED FROM PROTEINS [J].
CHOU, PY ;
FASMAN, GD .
BIOCHEMISTRY, 1974, 13 (02) :211-222
[7]  
CHU W, 2004, P INT C MACH LEARN I
[8]  
Cuff JA, 1999, PROTEINS, V34, P508, DOI 10.1002/(SICI)1097-0134(19990301)34:4<508::AID-PROT10>3.0.CO
[9]  
2-4
[10]  
GAMIER J, 1978, J MOL BIOL, V120, P97