Prediction and Quantification of Splice Events from RNA-Seq Data

被引:92
作者
Goldstein, Leonard D. [1 ,2 ]
Cao, Yi [1 ]
Pau, Gregoire [1 ]
Lawrence, Michael [1 ]
Wu, Thomas D. [1 ]
Seshagiri, Somasekar [2 ]
Gentleman, Robert [1 ,3 ]
机构
[1] Genentech Inc, Dept Bioinformat & Computat Biol, San Francisco, CA 94080 USA
[2] Genentech Inc, Dept Mol Biol, San Francisco, CA 94080 USA
[3] 23andMe Inc, Mountain View, CA USA
来源
PLOS ONE | 2016年 / 11卷 / 05期
关键词
TRANSCRIPTOME; ALIGNMENT; BROWSER;
D O I
10.1371/journal.pone.0156132
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Analysis of splice variants from short read RNA-seq data remains a challenging problem. Here we present a novel method for the genome-guided prediction and quantification of splice events from RNA-seq data, which enables the analysis of unannotated and complex splice events. Splice junctions and exons are predicted from reads mapped to a reference genome and are assembled into a genome-wide splice graph. Splice events are identified recursively from the graph and are quantified locally based on reads extending across the start or end of each splice variant. We assess prediction accuracy based on simulated and real RNA-seq data, and illustrate how different read aligners (GSNAP, HISAT2, STAR, TopHat2) affect prediction results. We validate our approach for quantification based on simulated data, and compare local estimates of relative splice variant usage with those from other methods (MISO, Cufflinks) based on simulated and real RNA-seq data. In a proof-of-concept study of splice variants in 16 normal human tissues (Illumina Body Map 2.0) we identify 249 internal exons that belong to known genes but are not related to annotated exons. Using independent RNA samples from 14 matched normal human tissues, we validate 9/9 of these exons by RT-PCR and 216/249 by paired-end RNA-seq (2 x 250 bp). These results indicate that de novo prediction of splice variants remains beneficial even in well-studied systems. An implementation of our method is freely available as an R/Bioconductor package SGSeq.
引用
收藏
页数:18
相关论文
共 33 条
  • [1] Alamancos GP, 2014, METHODS MOL BIOL, V1126, P357, DOI 10.1007/978-1-62703-980-2_26
  • [2] Conservation of an RNA regulatory map between Drosophila and mammals
    Brooks, Angela N.
    Yang, Li
    Duff, Michael O.
    Hansen, Kasper D.
    Park, Jung W.
    Dudoit, Sandrine
    Brenner, Steven E.
    Graveley, Brenton R.
    [J]. GENOME RESEARCH, 2011, 21 (02) : 193 - 202
  • [3] Csardi G., 2006, InterJournal: Complex Systems, V1965
  • [4] Landscape of transcription in human cells
    Djebali, Sarah
    Davis, Carrie A.
    Merkel, Angelika
    Dobin, Alex
    Lassmann, Timo
    Mortazavi, Ali
    Tanzer, Andrea
    Lagarde, Julien
    Lin, Wei
    Schlesinger, Felix
    Xue, Chenghai
    Marinov, Georgi K.
    Khatun, Jainab
    Williams, Brian A.
    Zaleski, Chris
    Rozowsky, Joel
    Roeder, Maik
    Kokocinski, Felix
    Abdelhamid, Rehab F.
    Alioto, Tyler
    Antoshechkin, Igor
    Baer, Michael T.
    Bar, Nadav S.
    Batut, Philippe
    Bell, Kimberly
    Bell, Ian
    Chakrabortty, Sudipto
    Chen, Xian
    Chrast, Jacqueline
    Curado, Joao
    Derrien, Thomas
    Drenkow, Jorg
    Dumais, Erica
    Dumais, Jacqueline
    Duttagupta, Radha
    Falconnet, Emilie
    Fastuca, Meagan
    Fejes-Toth, Kata
    Ferreira, Pedro
    Foissac, Sylvain
    Fullwood, Melissa J.
    Gao, Hui
    Gonzalez, David
    Gordon, Assaf
    Gunawardena, Harsha
    Howald, Cedric
    Jha, Sonali
    Johnson, Rory
    Kapranov, Philipp
    King, Brandon
    [J]. NATURE, 2012, 489 (7414) : 101 - 108
  • [5] STAR: ultrafast universal RNA-seq aligner
    Dobin, Alexander
    Davis, Carrie A.
    Schlesinger, Felix
    Drenkow, Jorg
    Zaleski, Chris
    Jha, Sonali
    Batut, Philippe
    Chaisson, Mark
    Gingeras, Thomas R.
    [J]. BIOINFORMATICS, 2013, 29 (01) : 15 - 21
  • [6] Profile hidden Markov models
    Eddy, SR
    [J]. BIOINFORMATICS, 1998, 14 (09) : 755 - 763
  • [7] Engström PG, 2013, NAT METHODS, V10, P1185, DOI [10.1038/NMETH.2722, 10.1038/nmeth.2722]
  • [8] Pfam: the protein families database
    Finn, Robert D.
    Bateman, Alex
    Clements, Jody
    Coggill, Penelope
    Eberhardt, Ruth Y.
    Eddy, Sean R.
    Heger, Andreas
    Hetherington, Kirstie
    Holm, Liisa
    Mistry, Jaina
    Sonnhammer, Erik L. L.
    Tate, John
    Punta, Marco
    [J]. NUCLEIC ACIDS RESEARCH, 2014, 42 (D1) : D222 - D230
  • [9] Florea Liliana, 2013, F1000Res, V2, P188, DOI 10.12688/f1000research.2-188.v1
  • [10] Bioconductor: open software development for computational biology and bioinformatics
    Gentleman, RC
    Carey, VJ
    Bates, DM
    Bolstad, B
    Dettling, M
    Dudoit, S
    Ellis, B
    Gautier, L
    Ge, YC
    Gentry, J
    Hornik, K
    Hothorn, T
    Huber, W
    Iacus, S
    Irizarry, R
    Leisch, F
    Li, C
    Maechler, M
    Rossini, AJ
    Sawitzki, G
    Smith, C
    Smyth, G
    Tierney, L
    Yang, JYH
    Zhang, JH
    [J]. GENOME BIOLOGY, 2004, 5 (10)