Analysis of canonical and non-canonical splice sites in mammalian genomes

被引:449
作者
Burset, M [1 ]
Seledtsov, IA [1 ]
Solovyev, VV [1 ]
机构
[1] Sanger Ctr, Informat Div, Cambridge CB10 1SA, England
基金
英国惠康基金;
关键词
D O I
10.1093/nar/28.21.4364
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
A set of 43 337 splice junction pairs was extracted from mammalian GenBank annotated genes. Expressed sequence tag (EST) sequences support 22 489 of them. Of these, 98.71% contain canonical dinucleotides: GT and AG for donor and acceptor sites, respectively; 0.56% hold non-canonical GC-AG splice site pairs; and the remaining 0.73% occurs in a lot of small groups (with a maximum size of 0.05%). Studying these groups we observe that many of them contain splicing dinucleotides shifted from the annotated splice junction by one position. After close examination of such cases we present a new classification consisting of only eight observed types of splice site pairs (out of 256 a priori possible combinations). EST alignments allow us to verify the exonic part of the splice sites, but many non-canonical cases may be due to intron sequencing errors. This idea is given substantial support when we compare the sequences of human genes having non-canonical splice sites deposited in GenBank by high throughput genome sequencing projects (HTG). A high proportion (156 out of 171) of the human non-canonical and EST-supported splice-site sequences had a clear match in the human HTG. They can be classified after corrections as: 79 GC-AG pairs (of which one was an error that corrected to GC-AG), 61 errors that were corrected to GT-BG;canonical pairs, six AT-AC pairs (of which two were-errors that corrected to AT-AC), one case was produced from non-existent intron, seven cases were found in HTG that were deposited to GenBank and finally there were only two cases left of supported non-canonical splice sites. If we assume that approximately the same situation is true for the whole: set of annotated mammalian non-canonical splice-sites, then the 99.24% of splice site pairs should be GT-AG, 0.69% GC-AG, 0.05% AT-AC and finally only 0.02% could consist of other types of non-canonical splice sites. We analyze several characteristics of EST-verified splice sites and build weight matrices for the major groups, which can be incorporated into gene prediction programs. We also present a set of EST-verified canonical splice sites larger by two orders of magnitude than the current one (22 199 entries versus similar to 600) and finally, a set of 290 EST-supported non-canonical splice sites, Both sets should be significant for future investigations of the splicing mechanism.
引用
收藏
页码:4364 / 4375
页数:12
相关论文
共 39 条
  • [1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [2] THE COMPLETE CODING SEQUENCE OF THE HUMAN A-RAF-1 ONCOGENE AND TRANSFORMING ACTIVITY OF A HUMAN A-RAF CARRYING RETROVIRUS
    BECK, TW
    HULEIHEL, M
    GUNNELL, M
    BONNER, TI
    RAPP, UR
    [J]. NUCLEIC ACIDS RESEARCH, 1987, 15 (02) : 595 - 609
  • [3] GenBank
    Benson, DA
    Boguski, MS
    Lipman, DJ
    Ostell, J
    Ouellette, BFF
    Rapp, BA
    Wheeler, DL
    [J]. NUCLEIC ACIDS RESEARCH, 1999, 27 (01) : 12 - 17
  • [4] DBEST - DATABASE FOR EXPRESSED SEQUENCE TAGS
    BOGUSKI, MS
    LOWE, TMJ
    TOLSTOSHEV, CM
    [J]. NATURE GENETICS, 1993, 4 (04) : 332 - 333
  • [5] BREATHNACH R, 1981, ANNU REV BIOCHEM, V50, P349, DOI 10.1146/annurev.bi.50.070181.002025
  • [6] OVALBUMIN GENE - EVIDENCE FOR A LEADER SEQUENCE IN MESSENGER-RNA AND DNA SEQUENCES AT EXON-INTRON BOUNDARIES
    BREATHNACH, R
    BENOIST, C
    OHARE, K
    GANNON, F
    CHAMBON, P
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1978, 75 (10) : 4853 - 4857
  • [7] Evolutionary fates and origins of U12-type introns
    Burge, CB
    Padgett, RA
    Sharp, PA
    [J]. MOLECULAR CELL, 1998, 2 (06) : 773 - 785
  • [8] Terminal intron dinucleotide sequences do not distinguish between U2- and U12-dependent introns
    Dietrich, RC
    Incorvaia, R
    Padgett, RA
    [J]. MOLECULAR CELL, 1997, 1 (01) : 151 - 160
  • [9] Mouse pale ear (ep) is homologous to human Hermansky-Pudlak syndrome and contains a rare 'AT-AC' intron
    Feng, GH
    Bailin, T
    Oh, J
    Spritz, RA
    [J]. HUMAN MOLECULAR GENETICS, 1997, 6 (05) : 793 - 797
  • [10] MUTATIONS IN A YEAST INTRON DEMONSTRATE THE IMPORTANCE OF SPECIFIC CONSERVED NUCLEOTIDES FOR THE 2 STAGES OF NUCLEAR MESSENGER-RNA SPLICING
    FOUSER, LA
    FRIESEN, JD
    [J]. CELL, 1986, 45 (01) : 81 - 93