Protein Secondary Structure Prediction Using Machine Learning

被引:0
作者
Saha, Sriparna [1 ]
Ekbal, Asif [1 ]
Sharma, Sidharth [1 ]
Bandyopadhyay, Sanghamitra [2 ]
Maulik, Ujjwal [3 ]
机构
[1] Indian Inst Technol Patna, Dept Comp Sci & Engn, Patna, Bihar, India
[2] Indian Stat Inst, Machine Intelligence Unit, Kolkata, India
[3] Jadavpur Univ, Dept Comp Sci & Engn, Kolkata, India
来源
INTELLIGENT INFORMATICS | 2013年 / 182卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Protein structure prediction is an important component in understanding protein structures and functions. Accurate prediction of protein secondary structure helps in understanding protein folding. In many applications such as drug discovery it is required to predict the secondary structure of unknown proteins. In this paper we report our first attempt to secondary structure predication, and approach it as a sequence classification problem, where the task is equivalent to assigning a sequence of labels (i.e. helix, sheet, and coil) to the given protein sequence. We propose an ensemble technique that is based on two stochastic supervised machine learning algorithms, namely Maximum Entropy Markov Model (MEMM) and Conditional Random Field (CRF). We identify and implement a set of features that mostly deal with the contextual information. The proposed approach is evaluated with a benchmark dataset, and it yields encouraging performance to explore it further. We obtain the highest predictive accuracy of 61.26% and segment overlap score (SOY) of 52.30%.
引用
收藏
页码:57 / +
页数:2
相关论文
共 5 条
[1]   GENERALIZED ITERATIVE SCALING FOR LOG-LINEAR MODELS [J].
DARROCH, JN ;
RATCLIFF, D .
ANNALS OF MATHEMATICAL STATISTICS, 1972, 43 (05) :1470-&
[2]  
Lafferty J.D., 2001, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, P282, DOI DOI 10.5555/645530.655813
[3]   From genome to function [J].
Thornton, JM .
SCIENCE, 2001, 292 (5524) :2095-+
[4]  
Zemla A, 1999, PROTEINS, V34, P220, DOI 10.1002/(SICI)1097-0134(19990201)34:2<220::AID-PROT7>3.0.CO
[5]  
2-K