Splice site identification using probabilistic parameters and SVM classification

被引:78
作者
Baten, A. K. M. A. [1 ]
Chang, B. C. H. [1 ]
Halgamuge, S. K. [1 ]
Li, Jason [1 ]
机构
[1] Univ Melbourne, DoMME, Dynam Syst & Control Res Grp, Melbourne, Vic 3010, Australia
关键词
D O I
10.1186/1471-2105-7-S5-S15
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Recent advances and automation in DNA sequencing technology has created a vast amount of DNA sequence data. This increasing growth of sequence data demands better and efficient analysis methods. Identifying genes in this newly accumulated data is an important issue in bioinformatics, and it requires the prediction of the complete gene structure. Accurate identification of splice sites in DNA sequences plays one of the central roles of gene structural prediction in eukaryotes. Effective detection of splice sites requires the knowledge of characteristics, dependencies, and relationship of nucleotides in the splice site surrounding region. A higher-order Markov model is generally regarded as a useful technique for modeling higher-order dependencies. However, their implementation requires estimating a large number of parameters, which is computationally expensive. Results: The proposed method for splice site detection consists of two stages: a first order Markov model (MMI) is used in the first stage and a support vector machine (SVM) with polynomial kernel is used in the second stage. The MMI serves as a pre-processing step for the SVM and takes DNA sequences as its input. It models the compositional features and dependencies of nucleotides in terms of probabilistic parameters around splice site regions. The probabilistic parameters are then fed into the SVM, which combines them nonlinearly to predict splice sites. When the proposed MMI-SVM model is compared with other existing standard splice site detection methods, it shows a superior performance in all the cases. Conclusion: We proposed an effective pre-processing scheme for the SVM and applied it for the identification of splice sites. This is a simple yet effective splice site detection method, which shows a better classification accuracy and computational speed than some other more complex methods.
引用
收藏
页数:15
相关论文
共 46 条
[1]  
[Anonymous], 1997, THESIS STANFORD U
[2]   Modeling splicing sites with pairwise correlations [J].
Arita, M ;
Tsuda, K ;
Asai, K .
BIOINFORMATICS, 2002, 18 :S27-S34
[3]   Computer model for recognition of functional transcription start sites in RNA polymerase II promoters of vertebrates [J].
Bajic, VB ;
Seah, SH ;
Chong, A ;
Krishnan, SPT ;
Koh, JLY ;
Brusic, V .
JOURNAL OF MOLECULAR GRAPHICS & MODELLING, 2003, 21 (05) :323-332
[4]   SPLICING OF BALBIANI RING-1 GENE PREMESSENGER RNA OCCURS SIMULTANEOUSLY WITH TRANSCRIPTION [J].
BAUREN, G ;
WIESLANDER, L .
CELL, 1994, 76 (01) :183-192
[5]   PREDICTION OF HUMAN MESSENGER-RNA DONOR AND ACCEPTOR SITES FROM THE DNA-SEQUENCE [J].
BRUNAK, S ;
ENGELBRECHT, J ;
KNUDSEN, S .
JOURNAL OF MOLECULAR BIOLOGY, 1991, 220 (01) :49-65
[6]   Prediction of complete gene structures in human genomic DNA [J].
Burge, C ;
Karlin, S .
JOURNAL OF MOLECULAR BIOLOGY, 1997, 268 (01) :78-94
[7]  
Burge CB, 1999, RNA WORLD, P525
[8]   Analysis of canonical and non-canonical splice sites in mammalian genomes [J].
Burset, M ;
Seledtsov, IA ;
Solovyev, VV .
NUCLEIC ACIDS RESEARCH, 2000, 28 (21) :4364-4375
[9]   Modeling splice sites with Bayes networks [J].
Cai, DY ;
Delcher, A ;
Kao, B ;
Kasif, S .
BIOINFORMATICS, 2000, 16 (02) :152-158
[10]   Splice site identification by idlBNs [J].
Castelo, Robert ;
Guigo, Roderic .
BIOINFORMATICS, 2004, 20 :69-76