GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions

被引:1765
作者
Besemer, J
Lomsadze, A
Borodovsky, M [1 ]
机构
[1] Georgia Inst Technol, Sch Biol, Atlanta, GA 30332 USA
[2] Georgia Inst Technol, Sch Math, Atlanta, GA 30332 USA
[3] Gene Probe Inc, Atlanta, GA 30033 USA
关键词
D O I
10.1093/nar/29.12.2607
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Improving the accuracy of prediction of gene starts is one of a few remaining open problems in computer prediction of prokaryotic genes. Its difficulty is caused by the absence of relatively strong sequence patterns identifying true translation initiation sites. In the current paper we show that the accuracy of gene start prediction can be improved by combining models of protein-coding and non-coding regions and models of regulatory sites near gene start within an iterative Hidden Markov model based algorithm. The new gene prediction method, called GeneMarkS, utilizes a non-supervised training procedure and can be used for a newly sequenced prokaryotic genome with no prior knowledge of any protein or rRNA genes. The GeneMarkS implementation uses an improved version of the gene finding program GeneMark.hmm, heuristic Markov models of coding and non-coding regions and the Gibbs sampling multiple alignment program. GeneMarkS predicted precisely 83.2% of the translation starts of GenBank annotated Bacillus subtilis genes and 94.4% of translation starts in an experimentally validated set of Escherichia coli genes, We have also observed that GeneMarkS detects prokaryotic genes, in terms of identifying open reading frames containing real genes, with an accuracy matching the level of the best currently used gene detection methods, Accurate translation start prediction, in addition to the refinement of protein sequence N-terminal data, provides the benefit of precise positioning of the sequence region situated upstream to a gene start. Therefore, sequence motifs related to transcription and translation regulatory sites can be revealed and analyzed with higher precision. These motifs were shown to possess a significant variability, the functional and evolutionary connections of which are discussed.
引用
收藏
页码:2607 / 2618
页数:12
相关论文
共 48 条
  • [11] Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence
    Cole, ST
    Brosch, R
    Parkhill, J
    Garnier, T
    Churcher, C
    Harris, D
    Gordon, SV
    Eiglmeier, K
    Gas, S
    Barry, CE
    Tekaia, F
    Badcock, K
    Basham, D
    Brown, D
    Chillingworth, T
    Connor, R
    Davies, R
    Devlin, K
    Feltwell, T
    Gentles, S
    Hamlin, N
    Holroyd, S
    Hornby, T
    Jagels, K
    Krogh, A
    McLean, J
    Moule, S
    Murphy, L
    Oliver, K
    Osborne, J
    Quail, MA
    Rajandream, MA
    Rogers, J
    Rutter, S
    Seeger, K
    Skelton, J
    Squares, R
    Squares, S
    Sulston, JE
    Taylor, K
    Whitehead, S
    Barrell, BG
    [J]. NATURE, 1998, 393 (6685) : 537 - +
  • [12] Improved microbial gene identification with GLIMMER
    Delcher, AL
    Harmon, D
    Kasif, S
    White, O
    Salzberg, SL
    [J]. NUCLEIC ACIDS RESEARCH, 1999, 27 (23) : 4636 - 4641
  • [13] RECOGNITION OF PROTEIN CODING REGIONS IN DNA-SEQUENCES
    FICKETT, JW
    [J]. NUCLEIC ACIDS RESEARCH, 1982, 10 (17) : 5303 - 5318
  • [14] WHOLE-GENOME RANDOM SEQUENCING AND ASSEMBLY OF HAEMOPHILUS-INFLUENZAE RD
    FLEISCHMANN, RD
    ADAMS, MD
    WHITE, O
    CLAYTON, RA
    KIRKNESS, EF
    KERLAVAGE, AR
    BULT, CJ
    TOMB, JF
    DOUGHERTY, BA
    MERRICK, JM
    MCKENNEY, K
    SUTTON, G
    FITZHUGH, W
    FIELDS, C
    GOCAYNE, JD
    SCOTT, J
    SHIRLEY, R
    LIU, LI
    GLODEK, A
    KELLEY, JM
    WEIDMAN, JF
    PHILLIPS, CA
    SPRIGGS, T
    HEDBLOM, E
    COTTON, MD
    UTTERBACK, TR
    HANNA, MC
    NGUYEN, DT
    SAUDEK, DM
    BRANDON, RC
    FINE, LD
    FRITCHMAN, JL
    FUHRMANN, JL
    GEOGHAGEN, NSM
    GNEHM, CL
    MCDONALD, LA
    SMALL, KV
    FRASER, CM
    SMITH, HO
    VENTER, JC
    [J]. SCIENCE, 1995, 269 (5223) : 496 - 512
  • [15] FRAENKEL YM, 1995, COMPUT APPL BIOSCI, V11, P379
  • [16] Combining diverse evidence for gene recognition in completely sequenced bacterial genomes
    Frishman, D
    Mironov, A
    Mewes, HW
    Gelfand, M
    [J]. NUCLEIC ACIDS RESEARCH, 1998, 26 (12) : 2941 - 2947
  • [17] FRISHMAN D, 1998, NUCLEIC ACIDS RES, V26, P3870
  • [18] RIGOROUS PATTERN-RECOGNITION METHODS FOR DNA-SEQUENCES - ANALYSIS OF PROMOTER SEQUENCES FROM ESCHERICHIA-COLI
    GALAS, DJ
    EGGERT, M
    WATERMAN, MS
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1985, 186 (01) : 117 - 128
  • [19] THE CODON PREFERENCE PLOT - GRAPHIC ANALYSIS OF PROTEIN CODING SEQUENCES AND PREDICTION OF GENE-EXPRESSION
    GRIBSKOV, M
    DEVEREUX, J
    BURGESS, RR
    [J]. NUCLEIC ACIDS RESEARCH, 1984, 12 (01) : 539 - 549
  • [20] Bacterial start site prediction
    Hannenhalli, SS
    Hayes, WS
    Hatzigeorgiou, AG
    Fickett, JW
    [J]. NUCLEIC ACIDS RESEARCH, 1999, 27 (17) : 3577 - 3582