Automatic annotation of eukaryotic genes, pseudogenes and promoters

被引:578
作者
Solovyev, Victor [1 ]
Kosarev, Peter
Seledsov, Igor
Vorobyev, Denis
机构
[1] Univ London Royal Holloway & Bedford New Coll, Dept Comp Sci, Egham TW20 0EX, Surrey, England
[2] Softberry Inc, Mt Kisco, NY 10549 USA
关键词
D O I
10.1186/gb-2006-7-s1-s10
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: The ENCODE gene prediction workshop (EGASP) has been organized to evaluate how well state-of-the-art automatic gene finding methods are able to reproduce the manual and experimental gene annotation of the human genome. We have used Softberry gene finding software to predict genes, pseudogenes and promoters in 44 selected ENCODE sequences representing approximately 1% (30 Mb) of the human genome. Predictions of gene finding programs were evaluated in terms of their ability to reproduce the ENCODE-HAVANA annotation. Results: The Fgenesh++ gene prediction pipeline can identify 91% of coding nucleotides with a specificity of 90%. Our automatic pseudogene finder (PSF program) found 90% of the manually annotated pseudogenes and some new ones. The Fprom promoter prediction program identifies 80% of TATA promoters sequences with one false positive prediction per 2,000 base-pairs (bp) and 50% of TATA-less promoters with one false positive prediction per 650 bp. It can be used to identify transcription start sites upstream of annotated coding parts of genes found by gene prediction software. Conclusions: We review our software and underlying methods for identifying these three important structural and functional genome components and discuss the accuracy of predictions, recent advances and open problems in annotating genomic sequences. We have demonstrated that our methods can be effectively used for initial automatic annotation of the eukaryotic genome.
引用
收藏
页数:12
相关论文
共 25 条
  • [1] Afifi A.A., 1979, Statistical Analysis: A Computer-Oriented Approach
  • [2] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [3] Dragon Promoter Finder: recognition of vertebrate RNA polymerase II promoters
    Bajic, VB
    Seah, SH
    Chong, A
    Zhang, GL
    Koh, JLY
    Brusic, V
    [J]. BIOINFORMATICS, 2002, 18 (01) : 198 - 199
  • [4] GenBank
    Benson, DA
    Boguski, MS
    Lipman, DJ
    Ostell, J
    Ouellette, BFF
    Rapp, BA
    Wheeler, DL
    [J]. NUCLEIC ACIDS RESEARCH, 1999, 27 (01) : 12 - 17
  • [5] Using GeneWise in the Drosophila annotation experiment
    Birney, E
    Durbin, R
    [J]. GENOME RESEARCH, 2000, 10 (04) : 547 - 548
  • [6] DBEST - DATABASE FOR EXPRESSED SEQUENCE TAGS
    BOGUSKI, MS
    LOWE, TMJ
    TOLSTOSHEV, CM
    [J]. NATURE GENETICS, 1993, 4 (04) : 332 - 333
  • [7] Prediction of complete gene structures in human genomic DNA
    Burge, C
    Karlin, S
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1997, 268 (01) : 78 - 94
  • [8] A vision for the future of genomics research
    Collins, FS
    Green, ED
    Guttmacher, AE
    Guyer, MS
    [J]. NATURE, 2003, 422 (6934) : 835 - 847
  • [9] The ENCODE (ENCyclopedia of DNA elements) Project
    Feingold, EA
    Good, PJ
    Guyer, MS
    Kamholz, S
    Liefer, L
    Wetterstrand, K
    Collins, FS
    Gingeras, TR
    Kampa, D
    Sekinger, EA
    Cheng, J
    Hirsch, H
    Ghosh, S
    Zhu, Z
    Pate, S
    Piccolboni, A
    Yang, A
    Tammana, H
    Bekiranov, S
    Kapranov, P
    Harrison, R
    Church, G
    Struhl, K
    Ren, B
    Kim, TH
    Barrera, LO
    Qu, C
    Van Calcar, S
    Luna, R
    Glass, CK
    Rosenfeld, MG
    Guigo, R
    Antonarakis, SE
    Birney, E
    Brent, M
    Pachter, L
    Reymond, A
    Dermitzakis, ET
    Dewey, C
    Keefe, D
    Denoeud, F
    Lagarde, J
    Ashurst, J
    Hubbard, T
    Wesselink, JJ
    Castelo, R
    Eyras, E
    Myers, RM
    Sidow, A
    Batzoglou, S
    [J]. SCIENCE, 2004, 306 (5696) : 636 - 640
  • [10] STATUS OF THE TRANSCRIPTION FACTORS DATABASE (TFD)
    GHOSH, D
    [J]. NUCLEIC ACIDS RESEARCH, 1993, 21 (13) : 3117 - 3118