Gene prediction with a hidden Markov model and a new intron submodel

被引:1114
作者
Stanke, Mario [1 ]
Waack, Stephan [2 ]
机构
[1] Univ Gottingen, Inst Mikrobiol & Genet, Abt Bioinformat, D-37077 Gottingen, Germany
[2] Univ Gottingen, Inst Numer & Angew Math, D-37083 Gottingen, Germany
关键词
D O I
10.1093/bioinformatics/btg1080
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: The problem of finding the genes in eukaryotic DNA sequences by computational methods is still not satisfactorily solved. Gene finding programs have achieved relatively high accuracy on short genomic sequences but do not perform well on longer sequences with an unknown number of genes in them. Here existing programs tend to predict many false exons. Results: We have developed a new program, AUGUSTUS, for the ab initio prediction of protein coding genes in eukaryotic genomes. The program is based on a Hidden Markov Model and integrates a number of known methods and submodels. It employs a new way of modeling intron lengths. We use a new donor splice site model, a new model for a short region directly upstream of the donor splice site model that takes the reading frame into account and apply a method that allows better GC-content dependent parameter estimation. AUGUSTUS predicts on longer sequences far more human and drosophila genes accurately than the ab initio gene prediction programs we compared it with, while at the same time being more specific. Availability: A web interface for AUGUSTUS and the executable program are located at http://augustus.gobics.de. Supplementary Information: The datasets used for testing and training are available at http://augustus.gobics.de/datasets/ Contact: mstanke@ gwdg.de
引用
收藏
页码:II215 / II225
页数:11
相关论文
共 27 条
  • [1] Bafna V., 2000, BIOINFORMATICS, V16, P190
  • [2] BIRNEY E, 1997, ISMB, V5, P56
  • [3] GENMARK - PARALLEL GENE RECOGNITION FOR BOTH DNA STRANDS
    BORODOVSKY, M
    MCININCH, J
    [J]. COMPUTERS & CHEMISTRY, 1993, 17 (02): : 123 - 133
  • [4] Prediction of complete gene structures in human genomic DNA
    Burge, C
    Karlin, S
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1997, 268 (01) : 78 - 94
  • [5] Burge C. B., 1997, THESIS
  • [6] Analysis of canonical and non-canonical splice sites in mammalian genomes
    Burset, M
    Seledtsov, IA
    Solovyev, VV
    [J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (21) : 4364 - 4375
  • [7] DURBIN R, 1999, BIOL SEQUENCE ANAL
  • [8] Gene recognition via spliced sequence alignment
    Gelfand, MS
    Mironov, AA
    Pevzner, PA
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1996, 93 (17) : 9061 - 9066
  • [9] An assessment of gene prediction accuracy in large DNA sequences
    Guigó, R
    Agarwal, P
    Abril, JF
    Burset, M
    Fickett, JW
    [J]. GENOME RESEARCH, 2000, 10 (10) : 1631 - 1642
  • [10] KORF I, 2001, BIOINFORMATICS, V1, pS1