Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm

被引:368
作者
Lomsadze, Alexandre [1 ,2 ]
Burns, Paul D. [1 ,2 ]
Borodovsky, Mark [1 ,2 ,3 ,4 ]
机构
[1] Joint Georgia Tech, Atlanta, GA 30332 USA
[2] Emory Wallace H Coulter Dept Biomed Engn, Atlanta, GA 30332 USA
[3] Georgia Tech, Sch Computat Sci & Engn, Atlanta, GA 30332 USA
[4] Moscow Inst Phys & Technol, Dept Bioinformat, Moscow 141700, Russia
基金
美国国家卫生研究院;
关键词
GENOME SEQUENCE; PREDICTION; DROSOPHILA; DATABASE; VECTOR;
D O I
10.1093/nar/gku557
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
We present a new approach to automatic training of a eukaryotic ab initio gene finding algorithm. With the advent of Next-Generation Sequencing, automatic training has become paramount, allowing genome annotation pipelines to keep pace with the speed of genome sequencing. Earlier we developed GeneMark-ES, currently the only gene finding algorithm for eukaryotic genomes that performs automatic training in unsupervised ab initio mode. The new algorithm, GeneMark-ET augments GeneMark-ES with a novel method that integrates RNA-Seq read alignments into the self-training procedure. Use of 'assembled' RNA-Seq transcripts is far from trivial; significant error rate of assembly was revealed in recent assessments. We demonstrated in computational experiments that the proposed method of incorporation of 'unassembled' RNA-Seq reads improves the accuracy of gene prediction; particularly, for the 1.3 GB genome of Aedes aegypti the mean value of prediction Sensitivity and Specificity at the gene level increased over GeneMark-ES by 24.5%. In the current surge of genomic data when the need for accurate sequence annotation is higher than ever, GeneMark-ET will be a valuable addition to the narrow arsenal of automatic gene prediction tools.
引用
收藏
页数:8
相关论文
共 26 条
[1]   The genome sequence of Drosophila melanogaster [J].
Adams, MD ;
Celniker, SE ;
Holt, RA ;
Evans, CA ;
Gocayne, JD ;
Amanatides, PG ;
Scherer, SE ;
Li, PW ;
Hoskins, RA ;
Galle, RF ;
George, RA ;
Lewis, SE ;
Richards, S ;
Ashburner, M ;
Henderson, SN ;
Sutton, GG ;
Wortman, JR ;
Yandell, MD ;
Zhang, Q ;
Chen, LX ;
Brandon, RC ;
Rogers, YHC ;
Blazej, RG ;
Champe, M ;
Pfeiffer, BD ;
Wan, KH ;
Doyle, C ;
Baxter, EG ;
Helt, G ;
Nelson, CR ;
Miklos, GLG ;
Abril, JF ;
Agbayani, A ;
An, HJ ;
Andrews-Pfannkoch, C ;
Baldwin, D ;
Ballew, RM ;
Basu, A ;
Baxendale, J ;
Bayraktaroglu, L ;
Beasley, EM ;
Beeson, KY ;
Benos, PV ;
Berman, BP ;
Bhandari, D ;
Bolshakov, S ;
Borkova, D ;
Botchan, MR ;
Bouck, J ;
Brokstein, P .
SCIENCE, 2000, 287 (5461) :2185-2195
[2]   Sequencing of Culex quinquefasciatus Establishes a Platform for Mosquito Comparative Genomics [J].
Arensburger, Peter ;
Megy, Karine ;
Waterhouse, Robert M. ;
Abrudan, Jenica ;
Amedeo, Paolo ;
Antelo, Beatriz ;
Bartholomay, Lyric ;
Bidwell, Shelby ;
Caler, Elisabet ;
Camara, Francisco ;
Campbell, Corey L. ;
Campbell, Kathryn S. ;
Casola, Claudio ;
Castro, Marta T. ;
Chandramouliswaran, Ishwar ;
Chapman, Sinead B. ;
Christley, Scott ;
Costas, Javier ;
Eisenstadt, Eric ;
Feschotte, Cedric ;
Fraser-Liggett, Claire ;
Guigo, Roderic ;
Haas, Brian ;
Hammond, Martin ;
Hansson, Bill S. ;
Hemingway, Janet ;
Hill, Sharon R. ;
Howarth, Clint ;
Ignell, Rickard ;
Kennedy, Ryan C. ;
Kodira, Chinnappa D. ;
Lobo, Neil F. ;
Mao, Chunhong ;
Mayhew, George ;
Michel, Kristin ;
Mori, Akio ;
Liu, Nannan ;
Naveira, Horacio ;
Nene, Vishvanath ;
Nguyen, Nam ;
Pearson, Matthew D. ;
Pritham, Ellen J. ;
Puiu, Daniela ;
Qi, Yumin ;
Ranson, Hilary ;
Ribeiro, Jose M. C. ;
Roberston, Hugh M. ;
Severson, David W. ;
Shumway, Martin ;
Stanke, Mario .
SCIENCE, 2010, 330 (6000) :86-88
[3]   Heuristic approach to deriving models for gene finding [J].
Besemer, J ;
Borodovsky, M .
NUCLEIC ACIDS RESEARCH, 1999, 27 (19) :3911-3920
[4]   Prediction of complete gene structures in human genomic DNA [J].
Burge, C ;
Karlin, S .
JOURNAL OF MOLECULAR BIOLOGY, 1997, 268 (01) :78-94
[5]   UnSplicer: mapping spliced RNA-seq reads in compact genomes and filtering noisy splicing [J].
Burns, Paul D. ;
Li, Yang ;
Ma, Jian ;
Borodovsky, Mark .
NUCLEIC ACIDS RESEARCH, 2014, 42 (04) :e25
[6]   EGASP:: the human ENCODE genome annotation assessment project [J].
Guigo, Roderic ;
Flicek, Paul ;
Abril, Josep F. ;
Reymond, Alexandre ;
Lagarde, Julien ;
Denoeud, France ;
Antonarakis, Stylianos ;
Ashburner, Michael ;
Bajic, Vladimir B. ;
Birney, Ewan ;
Castelo, Robert ;
Eyras, Eduardo ;
Ucla, Catherine ;
Gingeras, Thomas R. ;
Harrow, Jennifer ;
Hubbard, Tim ;
Lewis, Suzanna E. ;
Reese, Martin G. .
GENOME BIOLOGY, 2006, 7 (Suppl 1)
[7]   MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects [J].
Holt, Carson ;
Yandell, Mark .
BMC BIOINFORMATICS, 2011, 12
[8]   The genome sequence of the malaria mosquito Anopheles gambiae [J].
Holt, RA ;
Subramanian, GM ;
Halpern, A ;
Sutton, GG ;
Charlab, R ;
Nusskern, DR ;
Wincker, P ;
Clark, AG ;
Ribeiro, JMC ;
Wides, R ;
Salzberg, SL ;
Loftus, B ;
Yandell, M ;
Majoros, WH ;
Rusch, DB ;
Lai, ZW ;
Kraft, CL ;
Abril, JF ;
Anthouard, V ;
Arensburger, P ;
Atkinson, PW ;
Baden, H ;
de Berardinis, V ;
Baldwin, D ;
Benes, V ;
Biedler, J ;
Blass, C ;
Bolanos, R ;
Boscus, D ;
Barnstead, M ;
Cai, S ;
Center, A ;
Chatuverdi, K ;
Christophides, GK ;
Chrystal, MA ;
Clamp, M ;
Cravchik, A ;
Curwen, V ;
Dana, A ;
Delcher, A ;
Dew, I ;
Evans, CA ;
Flanigan, M ;
Grundschober-Freimoser, A ;
Friedli, L ;
Gu, ZP ;
Guan, P ;
Guigo, R ;
Hillenmeyer, ME ;
Hladun, SL .
SCIENCE, 2002, 298 (5591) :129-+
[9]   Repbase update, a database of eukaryotic repetitive elements [J].
Jurka, J ;
Kapitonov, VV ;
Pavlicek, A ;
Klonowski, P ;
Kohany, O ;
Walichiewicz, J .
CYTOGENETIC AND GENOME RESEARCH, 2005, 110 (1-4) :462-467
[10]   Gene finding in novel genomes [J].
Korf, I .
BMC BIOINFORMATICS, 2004, 5 (1)