Whole genome shotgun sequencing of Brassica oleracea and its application to gene discovery and annotation in Arabidopsis

被引:53
作者
Ayele, M [1 ]
Haas, BJ [1 ]
Kumar, N [1 ]
Wu, H [1 ]
Xiao, YL [1 ]
Van Aken, S [1 ]
Utterback, TR [1 ]
Wortman, JR [1 ]
White, OR [1 ]
Town, CD [1 ]
机构
[1] Inst Genome Res, Rockville, MD 20850 USA
关键词
D O I
10.1101/gr.3176505
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Through comparative studies of the model organism Arabidopsis thaliana and its close relative Brassica oleracea, we have identified conserved regions that represent potentially functional sequences overlooked by previous Arabidopsis genome annotation methods. A total of 454,274 whole genome shotgun sequences covering 283 Mb (0.44x) of the estimated 650 Mb Brassica genome were searched against the Arabidopsis genome, and conserved Arabidopsis genome sequences (CAGSs) were identified. Of these 229,735 conserved regions, 167,357 fell within or intersected existing gene models, while 60,378 were located in previously unannotated regions. After removal of sequences matching known proteins, CAGSs that were close to one another were chained together as potentially comprising portions of the same functional unit. This resulted in 27,347 chains of which 15,686 were sufficiently distant from existing gene annotations to be considered a novel conserved unit. Of 192 conserved regions examined, 58 were found to be expressed in our cDNA populations. Rapid amplification of cDNA ends (RACE) was used to obtain potentially full-length transcripts from these 58 regions. The resulting sequences led to the creation of 21 gene models at 17 new Arabidopsis loci and the addition of splice variants or updates to another 19 gene structures. In addition, CAGSs overlapping already annotated genes in Arabidopsis can provide guidance for manual improvement of existing gene models. Published genome-wide expression data based on whole genome tiling arrays and massively parallel signature sequencing were overlaid on the Brassica-Arabidopsis conserved sequences, and 1399 regions of intersection were identified. Collectively our results and these data sets suggest that several thousand new Arabidopsis genes remain to be identified and annotated.
引用
收藏
页码:487 / 495
页数:9
相关论文
共 36 条
  • [1] [Anonymous], GENOME BIOL
  • [2] Ansari-Lari MA, 1998, GENOME RES, V8, P29
  • [3] Chromosomal mapping of Brassica oleracea based on ESTs from Arabidopsis thaliana:: complexity of the comparative map
    Babula, D
    Kaczmarek, M
    Barakat, A
    Delseny, M
    Quiros, CF
    Sadowski, J
    [J]. MOLECULAR GENETICS AND GENOMICS, 2003, 268 (05) : 656 - 665
  • [4] Shotgun sample sequence comparisons between mouse and human genomes
    Bouck, JB
    Metzker, ML
    Gibbs, RA
    [J]. NATURE GENETICS, 2000, 25 (01) : 31 - 33
  • [5] BRACHAT S, 2003, ASHBYA GOSSYPII GENO, V4, pR45
  • [6] Genome sequence and comparative analysis of the model rodent malaria parasite Plasmodium yoelii yoelii
    Carlton, JM
    Angiuoli, SV
    Suh, BB
    Kooij, TW
    Pertea, M
    Silva, JC
    Ermolaeva, MD
    Allen, JE
    Selengut, JD
    Koo, HL
    Peterson, JD
    Pop, M
    Kosack, DS
    Shumway, MF
    Bidwell, SL
    Shallom, SJ
    van Aken, SE
    Riedmuller, SB
    Feldblyum, TV
    Cho, JK
    Quackenbush, J
    Sedegah, M
    Shoaibi, A
    Cummings, LM
    Florens, L
    Yates, JR
    Raine, JD
    Sinden, RE
    Harris, MA
    Cunningham, DA
    Preiser, PR
    Bergman, LW
    Vaidya, AB
    Van Lin, LH
    Janse, CJ
    Waters, AP
    Smith, HO
    White, OR
    Salzberg, SL
    Venter, JC
    Fraser, CM
    Hoffman, SL
    Gardner, MJ
    Carucci, DJ
    [J]. NATURE, 2002, 419 (6906) : 512 - 519
  • [7] Collinearity between a 30-centimorgan segment of Arabidopsis thaliana chromosome 4 and duplicated regions within the Brassica napus genome
    Cavell, AC
    Lydiate, DJ
    Parkin, IAP
    Dean, C
    Trick, M
    [J]. GENOME, 1998, 41 (01) : 62 - 69
  • [8] Using cauliflower to find conserved non-coding regions in Arabidopsis
    Colinas, J
    Birnbaum, K
    Benfey, PN
    [J]. PLANT PHYSIOLOGY, 2002, 129 (02) : 451 - 454
  • [9] WHOLE-GENOME RANDOM SEQUENCING AND ASSEMBLY OF HAEMOPHILUS-INFLUENZAE RD
    FLEISCHMANN, RD
    ADAMS, MD
    WHITE, O
    CLAYTON, RA
    KIRKNESS, EF
    KERLAVAGE, AR
    BULT, CJ
    TOMB, JF
    DOUGHERTY, BA
    MERRICK, JM
    MCKENNEY, K
    SUTTON, G
    FITZHUGH, W
    FIELDS, C
    GOCAYNE, JD
    SCOTT, J
    SHIRLEY, R
    LIU, LI
    GLODEK, A
    KELLEY, JM
    WEIDMAN, JF
    PHILLIPS, CA
    SPRIGGS, T
    HEDBLOM, E
    COTTON, MD
    UTTERBACK, TR
    HANNA, MC
    NGUYEN, DT
    SAUDEK, DM
    BRANDON, RC
    FINE, LD
    FRITCHMAN, JL
    FUHRMANN, JL
    GEOGHAGEN, NSM
    GNEHM, CL
    MCDONALD, LA
    SMALL, KV
    FRASER, CM
    SMITH, HO
    VENTER, JC
    [J]. SCIENCE, 1995, 269 (5223) : 496 - 512
  • [10] Leveraging the mouse genome for gene prediction in human: From whole-genome shotgun reads to a global synteny map
    Flicek, P
    Keibler, E
    Hu, P
    Korf, I
    Brent, MR
    [J]. GENOME RESEARCH, 2003, 13 (01) : 46 - 54