Seqping: gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data

被引:31
作者
Chan, Kuang-Lim [1 ,5 ]
Rosli, Rozana [1 ]
Tatarinova, Tatiana V. [2 ,3 ]
Hogan, Michael [4 ]
Firdaus-Raih, Mohd [5 ]
Low, Eng-Ti Leslie [1 ]
机构
[1] Malaysian Palm Oil Board, Adv Biotechnol & Breeding Ctr, 6 Persiaran Inst, Kajang 43000, Selangor, Malaysia
[2] Univ Southern Calif, Ctr Personalized Med, Los Angeles, CA USA
[3] Univ Southern Calif, Spatial Sci Inst, Los Angeles, CA USA
[4] Orion Genom, 4041 Forest Pk Ave, St Louis, MO 63108 USA
[5] Univ Kebangsaan Malaysia, Fac Sci & Technol, Bangi 43600, Selangor, Malaysia
基金
美国国家科学基金会;
关键词
Gene prediction; Gene model; Species specific HMM; HIDDEN MARKOV MODEL; REPETITIVE SEQUENCES; IDENTIFICATION; ANNOTATION; FEATURES; ELEMENTS; BIOLOGY; MAKER;
D O I
10.1186/s12859-016-1426-6
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Gene prediction is one of the most important steps in the genome annotation process. A large number of software tools and pipelines developed by various computing techniques are available for gene prediction. However, these systems have yet to accurately predict all or even most of the protein-coding regions. Furthermore, none of the currently available gene-finders has a universal Hidden Markov Model (HMM) that can perform gene prediction for all organisms equally well in an automatic fashion. Results: We present an automated gene prediction pipeline, Seqping that uses self-training HMM models and transcriptomic data. The pipeline processes the genome and transcriptome sequences of the target species using GlimmerHMM, SNAP, and AUGUSTUS pipelines, followed by MAKER2 program to combine predictions from the three tools in association with the transcriptomic evidence. Seqping generates species-specific HMMs that are able to offer unbiased gene predictions. The pipeline was evaluated using the Oryza sativa and Arabidopsis thaliana genomes. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis showed that the pipeline was able to identify at least 95% of BUSCO's plantae dataset. Our evaluation shows that Seqping was able to generate better gene predictions compared to three HMM-based programs (MAKER2, GlimmerHMM and AUGUSTUS) using their respective available HMMs. Seqping had the highest accuracy in rice (0.5648 for CDS, 0.4468 for exon, and 0.6695 nucleotide structure) and A. thaliana (0.5808 for CDS, 0.5955 for exon, and 0.8839 nucleotide structure). Conclusions: Seqping provides researchers a seamless pipeline to train species-specific HMMs and predict genes in newly sequenced or less-studied genomes. We conclude that the Seqping pipeline predictions are more accurate than gene predictions using the other three approaches with the default or available HMMs.
引用
收藏
页码:1 / 7
页数:7
相关论文
共 47 条
[1]   Evaluation of Codon Biology in Citrus and Poncirus trifoliata Based on Genomic Features and Frame Corrected Expressed Sequence Tags [J].
Ahmad, Touqeer ;
Sablok, Gaurav ;
Tatarinova, Tatiana V. ;
Xu, Qiang ;
Deng, Xiu-Xin ;
Guo, Wen-Wu .
DNA RESEARCH, 2013, 20 (02) :135-150
[2]   Insights into corn genes derived from large-scale cDNA sequencing [J].
Alexandrov, Nickolai N. ;
Brover, Vyacheslav V. ;
Freidin, Stanislav ;
Troukhan, Maxim E. ;
Tatarinova, Tatiana V. ;
Zhang, Hongyu ;
Swaller, Timothy J. ;
Lu, Yu-Ping ;
Bouck, John ;
Flavell, Richard B. ;
Feldmann, Kenneth A. .
PLANT MOLECULAR BIOLOGY, 2009, 69 (1-2) :179-194
[3]   Features of Arabidopsis genes and genome discovered using full-length cDNAs [J].
Alexandrov, NN ;
Troukhan, ME ;
Brover, VV ;
Tatarinova, T ;
Flavell, RB ;
Feldmann, KA .
PLANT MOLECULAR BIOLOGY, 2006, 60 (01) :69-85
[4]   JIGSAW: integration of multiple sources of evidence for gene prediction [J].
Allen, JE ;
Salzberg, SL .
BIOINFORMATICS, 2005, 21 (18) :3596-3603
[5]   JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions [J].
Allen, Jonathan E. ;
Majoros, William H. ;
Pertea, Mihaela ;
Salzberg, Steven L. .
GENOME BIOLOGY, 2006, 7 (Suppl 1)
[6]   Evaluation of gene structure prediction programs [J].
Burset, M ;
Guigo, R .
GENOMICS, 1996, 34 (03) :353-367
[7]   BLAST plus : architecture and applications [J].
Camacho, Christiam ;
Coulouris, George ;
Avagyan, Vahram ;
Ma, Ning ;
Papadopoulos, Jason ;
Bealer, Kevin ;
Madden, Thomas L. .
BMC BIOINFORMATICS, 2009, 10
[8]   MAKER-P: A Tool Kit for the Rapid Creation, Management, and Quality Control of Plant Genome Annotations [J].
Campbell, Michael S. ;
Law, MeiYee ;
Holt, Carson ;
Stein, Joshua C. ;
Moghe, Gaurav D. ;
Hufnagel, David E. ;
Lei, Jikai ;
Achawanantakun, Rujira ;
Jiao, Dian ;
Lawrence, Carolyn J. ;
Ware, Doreen ;
Shiu, Shin-Han ;
Childs, Kevin L. ;
Sun, Yanni ;
Jiang, Ning ;
Yandell, Mark .
PLANT PHYSIOLOGY, 2014, 164 (02) :513-524
[9]   MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes [J].
Cantarel, Brandi L. ;
Korf, Ian ;
Robb, Sofia M. C. ;
Parra, Genis ;
Ross, Eric ;
Moore, Barry ;
Holt, Carson ;
Alvarado, Alejandro Sanchez ;
Yandell, Mark .
GENOME RESEARCH, 2008, 18 (01) :188-196
[10]   Conrad: Gene prediction using conditional random fields [J].
DeCaprio, David ;
Vinson, Jade P. ;
Pearson, Matthew D. ;
Montgomery, Philip ;
Doherty, Matthew ;
Galagan, James E. .
GENOME RESEARCH, 2007, 17 (09) :1389-1398