Identification of protein coding regions in RNA transcripts

被引:319
作者
Tang, Shiyuyun [1 ]
Lomsadze, Alexandre [2 ]
Borodovsky, Mark [2 ,3 ,4 ,5 ]
机构
[1] Georgia Inst Technol, Sch Biol, Atlanta, GA 30332 USA
[2] Georgia Inst Technol, Joint Georgia Tech & Emory Wallace H Coulter Dept, Atlanta, GA 30332 USA
[3] Georgia Inst Technol, Sch Computat Sci & Engn, Atlanta, GA 30332 USA
[4] Georgia Inst Technol, Ctr Bioinformat & Computat Genom, Atlanta, GA 30332 USA
[5] Moscow Inst Phys & Technol, Dept Biol & Med Phys, Moscow, Russia
基金
美国国家卫生研究院;
关键词
TRANSLATION; GENOMES; SEQ; RECONSTRUCTION; GENERATION; PREDICTION; SEQUENCES; CELLS;
D O I
10.1093/nar/gkv227
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Massive parallel sequencing of RNA transcripts by next-generation technology (RNA-Seq) generates critically important data for eukaryotic gene discovery. Gene finding in transcripts can be done by statistical (alignment-free) as well as by alignment-based methods. We describe a new tool, GeneMarkS-T, for ab initio identification of protein-coding regions in RNA transcripts. The algorithm parameters are estimated by unsupervised training which makes unnecessary manually curated preparation of training sets. We demonstrate that (i) the unsupervised training is robust with respect to the presence of transcripts assembly errors and (ii) the accuracy of GeneMarkS-T in identifying protein-coding regions and, particularly, in predicting translation initiation sites in modelled as well as in assembled transcripts compares favourably to other existing methods.
引用
收藏
页数:10
相关论文
共 28 条
[1]   Identification of the nature of reading frame transitions observed in prokaryotic genomes [J].
Antonov, Ivan ;
Coakley, Arthur ;
Atkins, John F. ;
Baranov, Pavel V. ;
Borodovsky, Mark .
NUCLEIC ACIDS RESEARCH, 2013, 41 (13) :6514-6530
[2]  
Antonov Ivan, 2010, Journal of Bioinformatics and Computational Biology, V8, P535, DOI 10.1142/S0219720010004847
[3]   Heuristic approach to deriving models for gene finding [J].
Besemer, J ;
Borodovsky, M .
NUCLEIC ACIDS RESEARCH, 1999, 27 (19) :3911-3920
[4]   GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions [J].
Besemer, J ;
Lomsadze, A ;
Borodovsky, M .
NUCLEIC ACIDS RESEARCH, 2001, 29 (12) :2607-2618
[5]   Codon usage between genomes is constrained by genome-wide mutational processes [J].
Chen, SL ;
Lee, W ;
Hottes, AK ;
Shapiro, L ;
McAdams, HH .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2004, 101 (10) :3480-3485
[6]   Identifying bacterial genes and endosymbiont DNA with Glimmer [J].
Delcher, Arthur L. ;
Bratke, Kirsten A. ;
Powers, Edwin C. ;
Salzberg, Steven L. .
BIOINFORMATICS, 2007, 23 (06) :673-679
[7]   Genome-wide search for novel human uORFs and N-terminal protein extensions using ribosomal footprinting [J].
Fritsch, Claudia ;
Herrmann, Alexander ;
Nothnagel, Michael ;
Szafranski, Karol ;
Huse, Klaus ;
Schumann, Frank ;
Schreiber, Stefan ;
Platzer, Matthias ;
Krawczak, Michael ;
Hampe, Jochen ;
Brosch, Mario .
GENOME RESEARCH, 2012, 22 (11) :2208-2218
[8]  
Garber M, 2011, NAT METHODS, V8, P469, DOI [10.1038/NMETH.1613, 10.1038/nmeth.1613]
[9]   Translation inhibitors cause abnormalities in ribosome profiling experiments [J].
Gerashchenko, Maxim V. ;
Gladyshev, Vadim N. .
NUCLEIC ACIDS RESEARCH, 2014, 42 (17)
[10]   De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis [J].
Haas, Brian J. ;
Papanicolaou, Alexie ;
Yassour, Moran ;
Grabherr, Manfred ;
Blood, Philip D. ;
Bowden, Joshua ;
Couger, Matthew Brian ;
Eccles, David ;
Li, Bo ;
Lieber, Matthias ;
MacManes, Matthew D. ;
Ott, Michael ;
Orvis, Joshua ;
Pochet, Nathalie ;
Strozzi, Francesco ;
Weeks, Nathan ;
Westerman, Rick ;
William, Thomas ;
Dewey, Colin N. ;
Henschel, Robert ;
Leduc, Richard D. ;
Friedman, Nir ;
Regev, Aviv .
NATURE PROTOCOLS, 2013, 8 (08) :1494-1512