Gene structure prediction from consensus spliced alignment of multiple ESTs matching the same genomic locus

被引：73

作者：

Brendel, V

Xing, LQ

Zhu, W

机构：

[1] Iowa State Univ, Dept Genet Dev & Cell Biol, Ames, IA 50011 USA

[2] Iowa State Univ, Dept Stat, Ames, IA 50011 USA

来源：

BIOINFORMATICS | 2004年 / 20卷 / 07期

基金：

美国国家科学基金会;

关键词：

D O I：

10.1093/bioinformatics/bth058

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: Accurate gene structure annotation is a challenging computational problem in genomics. The best results are achieved with spliced alignment of full-length cDNAs or multiple expressed sequence tags (ESTs) with sufficient overlap to cover the entire gene. For most species, cDNA and EST collections are far from comprehensive. We sought to overcome this bottleneck by exploring the possibility of using combined EST resources from fairly diverged species that still share a common gene space. Previous spliced alignment tools were found inadequate for this task because they rely on very high sequence similarity between the ESTs and the genomic DNA. Results: We have developed a computer program, GeneSeqer, which is capable of aligning thousands of ESTs with a long genomic sequence in a reasonable amount of time. The algorithm is uniquely designed to tolerate a high percentage of mismatches and insertions or deletions in the EST relative to the genomic template. This feature allows use of non-cognate ESTs for gene structure prediction, including ESTs derived from duplicated genes and homologous genes from related species. The increased gene prediction sensitivity results in part from novel splice site prediction models that are also available as a stand-alone splice site prediction tool. We assessed GeneSeqer performance relative to a standard Arabidopsis thaliana gene set and demonstrate its utility for plant genome annotation. In particular, we propose that this method provides a timely tool for the annotation of the rice genome, using abundant ESTs from other cereals and plants.

引用

页码：1157 / 1169

页数：13

共 45 条

[1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].