Gene structure prediction from consensus spliced alignment of multiple ESTs matching the same genomic locus

被引:73
作者
Brendel, V
Xing, LQ
Zhu, W
机构
[1] Iowa State Univ, Dept Genet Dev & Cell Biol, Ames, IA 50011 USA
[2] Iowa State Univ, Dept Stat, Ames, IA 50011 USA
基金
美国国家科学基金会;
关键词
D O I
10.1093/bioinformatics/bth058
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Accurate gene structure annotation is a challenging computational problem in genomics. The best results are achieved with spliced alignment of full-length cDNAs or multiple expressed sequence tags (ESTs) with sufficient overlap to cover the entire gene. For most species, cDNA and EST collections are far from comprehensive. We sought to overcome this bottleneck by exploring the possibility of using combined EST resources from fairly diverged species that still share a common gene space. Previous spliced alignment tools were found inadequate for this task because they rely on very high sequence similarity between the ESTs and the genomic DNA. Results: We have developed a computer program, GeneSeqer, which is capable of aligning thousands of ESTs with a long genomic sequence in a reasonable amount of time. The algorithm is uniquely designed to tolerate a high percentage of mismatches and insertions or deletions in the EST relative to the genomic template. This feature allows use of non-cognate ESTs for gene structure prediction, including ESTs derived from duplicated genes and homologous genes from related species. The increased gene prediction sensitivity results in part from novel splice site prediction models that are also available as a stand-alone splice site prediction tool. We assessed GeneSeqer performance relative to a standard Arabidopsis thaliana gene set and demonstrate its utility for plant genome annotation. In particular, we propose that this method provides a timely tool for the annotation of the rice genome, using abundant ESTs from other cereals and plants.
引用
收藏
页码:1157 / 1169
页数:13
相关论文
共 45 条
  • [1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [2] [Anonymous], GENOME BIOL
  • [3] A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome
    Blanc, G
    Hokamp, K
    Wolfe, KH
    [J]. GENOME RESEARCH, 2003, 13 (02) : 137 - 144
  • [4] DBEST - DATABASE FOR EXPRESSED SEQUENCE TAGS
    BOGUSKI, MS
    LOWE, TMJ
    TOLSTOSHEV, CM
    [J]. NATURE GENETICS, 1993, 4 (04) : 332 - 333
  • [5] Comparison of gene indexing databases
    Bouck, J
    Yu, W
    Gibbs, R
    Worley, K
    [J]. TRENDS IN GENETICS, 1999, 15 (04) : 159 - 162
  • [6] Prediction of locally optimal splice sites in plant pre-mRNA with applications to gene identification in Arabidopsis thaliana genomic DNA
    Brendel, V
    Kleffe, J
    [J]. NUCLEIC ACIDS RESEARCH, 1998, 26 (20) : 4748 - 4757
  • [7] Alternative splicing and genome complexity
    Brett, D
    Pospisil, H
    Valcárcel, J
    Reich, J
    Bork, P
    [J]. NATURE GENETICS, 2002, 30 (01) : 29 - 30
  • [8] PREDICTION OF HUMAN MESSENGER-RNA DONOR AND ACCEPTOR SITES FROM THE DNA-SEQUENCE
    BRUNAK, S
    ENGELBRECHT, J
    KNUDSEN, S
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1991, 220 (01) : 49 - 65
  • [9] Intron-exon structures of eukaryotic model organisms
    Deutsch, M
    Long, M
    [J]. NUCLEIC ACIDS RESEARCH, 1999, 27 (15) : 3219 - 3228
  • [10] A computer program for aligning a cDNA sequence with a genomic DNA sequence
    Florea, L
    Hartzell, G
    Zhang, Z
    Rubin, GM
    Miller, W
    [J]. GENOME RESEARCH, 1998, 8 (09) : 967 - 974