Prediction of Indel flanking regions in protein sequences using a variable-order Markov model

被引:4
作者
Al-Shatnawi, Mufleh [1 ]
Ahmad, M. Omair [1 ]
Swamy, M. N. S. [1 ]
机构
[1] Concordia Univ, Dept Elect & Comp Engn, Montreal, PQ H3G 2W1, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
INSERTIONS/DELETIONS; SUBSTITUTION; ALIGNMENT; EVOLUTION; DATABASE; IMPACT;
D O I
10.1093/bioinformatics/btu556
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Insertion/deletion (indel) and amino acid substitution are two common events that lead to the evolution of and variations in protein sequences. Further, many of the human diseases and functional divergence between homologous proteins are more related to indel mutations, even though they occur less often than the substitution mutations do. A reliable identification of indels and their flanking regions is a major challenge in research related to protein evolution, structures and functions. Results: In this article, we propose a novel scheme to predict indel flanking regions in a protein sequence for a given protein fold, based on a variable-order Markov model. The proposed indel flanking region (IndeIFR) predictors are designed based on prediction by partial match (PPM) and probabilistic suffix tree (PST), which are referred to as the PPM IndeIFR and PST IndeIFR predictors, respectively. The overall performance evaluation results show that the proposed predictors are able to predict IndeIFRs in the protein sequences with a high accuracy and F1 measure. In addition, the results show that if one is interested only in predicting IndeIFRs in protein sequences, it would be preferable to use the proposed predictors instead of HMMER 3.0 in view of the substantially superior performance of the former.
引用
收藏
页码:40 / 47
页数:8
相关论文
共 36 条
[1]   Data growth and its impact on the SCOP database: new developments [J].
Andreeva, Antonina ;
Howorth, Dave ;
Chandonia, John-Marc ;
Brenner, Steven E. ;
Hubbard, Tim J. P. ;
Chothia, Cyrus ;
Murzin, Alexey G. .
NUCLEIC ACIDS RESEARCH, 2008, 36 :D419-D425
[2]  
[Anonymous], BIOINFORMATICS
[3]  
[Anonymous], PATTERN RECOGN LETT
[4]  
[Anonymous], INTRO COMPUTATIONAL
[5]  
Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkh121, 10.1093/nar/gkr1065]
[6]   On prediction using variable order Markov models [J].
Begleiter, R ;
El-Yaniv, R ;
Yona, G .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2004, 22 :385-421
[7]   Variations on probabilistic suffix trees: statistical modeling and prediction of protein families [J].
Bejerano, G ;
Yona, G .
BIOINFORMATICS, 2001, 17 (01) :23-43
[8]   EMPIRICAL AND STRUCTURAL MODELS FOR INSERTIONS AND DELETIONS IN THE DIVERGENT EVOLUTION OF PROTEINS [J].
BENNER, SA ;
COHEN, MA ;
GONNET, GH .
JOURNAL OF MOLECULAR BIOLOGY, 1993, 229 (04) :1065-1082
[9]   Majority of divergence between closely related DNA samples is due to indels [J].
Britten, RJ ;
Rowen, L ;
Williams, J ;
Cameron, RA .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2003, 100 (08) :4661-4665
[10]  
Bühlmann P, 1999, ANN STAT, V27, P480