Gene structure prediction by spliced alignment of genomic DNA with protein sequences: Increased accuracy by differential splice site scoring

被引:41
作者
Usuka, J
Brendel, V
机构
[1] Iowa State Univ Sci & Technol, Dept Zool & Genet, Ames, IA 50011 USA
[2] Stanford Univ, Dept Chem, Stanford, CA 94305 USA
基金
美国国家科学基金会;
关键词
target protein; intron; spliced alignment; dynamic programming; Hidden Markov Model;
D O I
10.1006/jmbi.2000.3641
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Gene identification in genomic DNA from eukaryotes is complicated by the vast combinatorial possibilities of potential exon assemblies. If the gene encodes a protein that is closely related to known proteins, gene identification is aided by matching similarity of potential translation products to those target proteins. The genomic DNA and protein sequences can be aligned directly by scoring the implied residues of in-frame nucleotide triplets against the protein residues in conventional ways, while allowing for long gaps in the alignment corresponding to introns in the genomic DNA. We describe a novel method for such spliced alignment. The method derives an optimal alignment based on scoring for both sequence similarity of the predicted gene product to the protein sequence and intrinsic splice site strength of the predicted introns. Application of the method to a representative set of 50 known genes from Arabidopsis thaliana showed significant improvement in prediction accuracy compared to previous spliced alignment methods. The method is also more accurate than ab initio gene prediction methods, provided sufficiently close target proteins are available. In view of the fast growth of public sequence repositories, we argue that close targets will be available for the majority of novel genes, making spliced alignment an excellent practical tool for high-throughput automated genome annotation. (C) 2000 Academic Press.
引用
收藏
页码:1075 / 1085
页数:11
相关论文
共 34 条
  • [21] A tool for analyzing and annotating genomic sequences
    Huang, XQ
    Adams, MD
    Zhou, H
    Kerlavage, AR
    [J]. GENOMICS, 1997, 46 (01) : 37 - 45
  • [22] Logitlinear models for the prediction of splice sites in plant pre-mRNA sequences
    Kleffe, J
    Hermann, K
    Vahrson, W
    Wittig, B
    Brendel, V
    [J]. NUCLEIC ACIDS RESEARCH, 1996, 24 (23) : 4709 - 4718
  • [23] GeneGenerator - a flexible algorithm for gene prediction and its application to maize sequences
    Kleffe, J
    Hermann, K
    Vahrson, W
    Wittig, B
    Brendel, V
    [J]. BIOINFORMATICS, 1998, 14 (03) : 232 - 243
  • [24] Targeting of a human iron-sulfur cluster assembly enzyme, nifs, to different subcellular compartments is regulated through alternative AUG utilization
    Land, T
    Rouault, TA
    [J]. MOLECULAR CELL, 1998, 2 (06) : 807 - 815
  • [25] Arabidopsis thaliana:: A model plant for genome analysis
    Meinke, DW
    Cherry, JM
    Dean, C
    Rounsley, SD
    Koornneef, M
    [J]. SCIENCE, 1998, 282 (5389) : 662 - +
  • [26] OPTIMAL ALIGNMENTS IN LINEAR-SPACE
    MYERS, EW
    MILLER, W
    [J]. COMPUTER APPLICATIONS IN THE BIOSCIENCES, 1988, 4 (01): : 11 - 17
  • [27] FINDING ERRORS IN DNA-SEQUENCES
    POSFAI, J
    ROBERTS, RJ
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1992, 89 (10) : 4698 - 4702
  • [28] Rogozin IB, 1996, COMPUT APPL BIOSCI, V12, P161
  • [29] MOLECULAR SEQUENCE ACCURACY AND THE ANALYSIS OF PROTEIN CODING REGIONS
    STATES, DJ
    BOTSTEIN, D
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1991, 88 (13) : 5518 - 5522
  • [30] Algorithms and software for support of gene identification experiments
    Sze, SH
    Roytberg, MA
    Gelfand, MS
    Mironov, AA
    Astakhova, TV
    Pevzner, PA
    [J]. BIOINFORMATICS, 1998, 14 (01) : 14 - 19