Semi-supervised learning of Hidden Markov Models for biological sequence analysis

被引:19
作者
Tamposis, Ioannis A. [1 ]
Tsirigos, Konstantinos D. [2 ]
Theodoropoulou, Margarita C. [1 ]
Kontou, Panagiota, I [1 ]
Bagos, Pantelis G. [1 ]
机构
[1] Univ Thessaly, Dept Comp Sci & Biomed Informat, Lamia 35100, Greece
[2] Tech Univ Denmark, Dept Bio & Hlth Informat, Lyngby, Denmark
关键词
TRANSMEMBRANE PROTEIN TOPOLOGY; LIPOPROTEIN SIGNAL PEPTIDES; GRAM-POSITIVE BACTERIA; MAXIMUM-LIKELIHOOD; PREDICTION; ALGORITHM; DATABASE;
D O I
10.1093/bioinformatics/bty910
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation Hidden Markov Models (HMMs) are probabilistic models widely used in applications in computational sequence analysis. HMMs are basically unsupervised models. However, in the most important applications, they are trained in a supervised manner. Training examples accompanied by labels corresponding to different classes are given as input and the set of parameters that maximize the joint probability of sequences and labels is estimated. A main problem with this approach is that, in the majority of the cases, labels are hard to find and thus the amount of training data is limited. On the other hand, there are plenty of unclassified (unlabeled) sequences deposited in the public databases that could potentially contribute to the training procedure. This approach is called semi-supervised learning and could be very helpful in many applications. Results We propose here, a method for semi-supervised learning of HMMs that can incorporate labeled, unlabeled and partially labeled data in a straightforward manner. The algorithm is based on a variant of the Expectation-Maximization (EM) algorithm, where the missing labels of the unlabeled or partially labeled data are considered as the missing data. We apply the algorithm to several biological problems, namely, for the prediction of transmembrane protein topology for alpha-helical and beta-barrel membrane proteins and for the prediction of archaeal signal peptides. The results are very promising, since the algorithms presented here can significantly improve the prediction performance of even the top-scoring classifiers.
引用
收藏
页码:2208 / 2215
页数:8
相关论文
共 48 条
[1]  
Abney S, 2004, COMPUT LINGUIST, V30, P364
[2]  
[Anonymous], 2006, IEEE T NEURAL NETWOR
[3]  
[Anonymous], BMC BIOINFORMATICS
[4]  
ASAI K, 1993, COMPUT APPL BIOSCI, V9, P141
[5]   Prediction of signal peptides in archaea [J].
Bagos, P. G. ;
Tsirigos, K. D. ;
Plessas, S. K. ;
Liakopoulos, T. D. ;
Hamodrakas, S. J. .
PROTEIN ENGINEERING DESIGN & SELECTION, 2009, 22 (01) :27-35
[6]   Algorithms for incorporating prior topological information in HMMs: application to transmembrane proteins [J].
Bagos, Pantelis G. ;
Liakopoulos, Theodore D. ;
Hamodrakas, Stavros J. .
BMC BIOINFORMATICS, 2006, 7 (1)
[7]  
Bagos Pantelis G., 2009, Genomics Proteomics & Bioinformatics, V7, P128, DOI 10.1016/S1672-0229(08)60041-8
[8]   Prediction of Lipoprotein Signal Peptides in Gram-Positive Bacteria with a Hidden Markov Model [J].
Bagos, Pantells G. ;
Tslrigos, Konstantinos D. ;
Liakopoulos, Theodore D. ;
Hamodrakas, Stavros J. .
JOURNAL OF PROTEOME RESEARCH, 2008, 7 (12) :5082-5093
[9]   Evaluation of methods for predicting the topology of β-barrel outer membrane proteins and a consensus prediction method -: art. no. 7 [J].
Bagos, PG ;
Liakopoulos, TD ;
Hamodrakas, SJ .
BMC BIOINFORMATICS, 2005, 6 (1)
[10]  
Bagos PG, 2004, LECT NOTES COMPUT SC, V3264, P40