Accurate splice site prediction using support vector machines

被引:136
作者
Sonnenburg, Soeren [2 ]
Schweikert, Gabriele [1 ,3 ,4 ]
Philips, Petra [1 ]
Behr, Jonas [1 ]
Raetsch, Gunnar [1 ]
机构
[1] Max Planck Gesell, Friedrich Miescher Lab, D-72076 Tubingen, Germany
[2] Fraunhofer Inst FIRST, D-12489 Berlin, Germany
[3] Max Planck Inst Biol Cybernet, D-72076 Tubingen, Germany
[4] Max Planck Inst Dev Biol, D-72076 Tubingen, Germany
关键词
D O I
10.1186/1471-2105-8-S10-S7
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: For splice site recognition, one has to solve two classification problems: discriminating true from decoy splice sites for both acceptor and donor sites. Gene finding systems typically rely on Markov Chains to solve these tasks. Results: In this work we consider Support Vector Machines for splice site recognition. We employ the so-called weighted degree kernel which turns out well suited for this task, as we will illustrate in several experiments where we compare its prediction accuracy with that of recently proposed systems. We apply our method to the genome-wide recognition of splice sites in Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, and Homo sapiens. Our performance estimates indicate that splice sites can be recognized very accurately in these genomes and that our method outperforms many other methods including Markov Chains, GeneSplicer and SpliceMachine. We provide genome-wide predictions of splice sites and a stand-alone prediction tool ready to be used for incorporation in a gene finder. Availability: Data, splits, additional information on the model selection, the whole genome predictions, as well as the stand-alone prediction tool are available for download at http://www.fml.mpg.de/raetsch/projects/splice.
引用
收藏
页数:16
相关论文
共 61 条
[1]   A haplotype map of the human genome [J].
Altshuler, D ;
Brooks, LD ;
Chakravarti, A ;
Collins, FS ;
Daly, MJ ;
Donnelly, P ;
Gibbs, RA ;
Belmont, JW ;
Boudreau, A ;
Leal, SM ;
Hardenbol, P ;
Pasternak, S ;
Wheeler, DA ;
Willis, TD ;
Yu, FL ;
Yang, HM ;
Zeng, CQ ;
Gao, Y ;
Hu, HR ;
Hu, WT ;
Li, CH ;
Lin, W ;
Liu, SQ ;
Pan, H ;
Tang, XL ;
Wang, J ;
Wang, W ;
Yu, J ;
Zhang, B ;
Zhang, QR ;
Zhao, HB ;
Zhao, H ;
Zhou, J ;
Gabriel, SB ;
Barry, R ;
Blumenstiel, B ;
Camargo, A ;
Defelice, M ;
Faggart, M ;
Goyette, M ;
Gupta, S ;
Moore, J ;
Nguyen, H ;
Onofrio, RC ;
Parkin, M ;
Roy, J ;
Stahl, E ;
Winchester, E ;
Ziaugra, L ;
Shen, Y .
NATURE, 2005, 437 (7063) :1299-1320
[2]  
[Anonymous], 2002, Proc. of the Intl. Conf. on Research in Computational Molecular Biology
[3]  
[Anonymous], 2003, HP INVENT
[4]  
[Anonymous], 1998, Encyclopedia of Biostatistics
[5]  
[Anonymous], SPLICEMACHINE
[6]   Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment [J].
Bajic, Vladimir B. ;
Brent, Michael R. ;
Brown, Randall H. ;
Frankish, Adam ;
Harrow, Jennifer ;
Ohler, Uwe ;
Solovyev, Victor V. ;
Tan, Sin Lam .
GENOME BIOLOGY, 2006, 7 (Suppl 1)
[7]   Splice site identification using probabilistic parameters and SVM classification [J].
Baten, A. K. M. A. ;
Chang, B. C. H. ;
Halgamuge, S. K. ;
Li, Jason .
BMC BIOINFORMATICS, 2006, 7 (Suppl 5)
[8]   Global discriminative learning for higher-accuracy computational gene prediction [J].
Bernal, Axel ;
Crammer, Koby ;
Hatzigeorgiou, Artemis ;
Pereira, Fernando .
PLOS COMPUTATIONAL BIOLOGY, 2007, 3 (03) :488-497
[9]   DBEST - DATABASE FOR EXPRESSED SEQUENCE TAGS [J].
BOGUSKI, MS ;
LOWE, TMJ ;
TOLSTOSHEV, CM .
NATURE GENETICS, 1993, 4 (04) :332-333
[10]   Knowledge-based analysis of microarray gene expression data by using support vector machines [J].
Brown, MPS ;
Grundy, WN ;
Lin, D ;
Cristianini, N ;
Sugnet, CW ;
Furey, TS ;
Ares, M ;
Haussler, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (01) :262-267