Sequencing of RNAs (RNA-Seq) has revolutionized the field of transcriptomics, but the reads obtained often contain errors. Read error correction can have a large impact on our ability to accurately assemble transcripts. This is especially true for de novo transcriptome analysis, where a reference genome is not available. Current read error correction methods, developed for DNA sequence data, cannot handle the overlapping effects of non-uniform abundance, polymorphisms and alternative splicing. Here we present SEquencing Error CorrEction in Rna-seq data (SEECER), a hidden Markov Model (HMM)-based method, which is the first to successfully address these problems. SEECER efficiently learns hundreds of thousands of HMMs and uses these to correct sequencing errors. Using human RNA-Seq data, we show that SEECER greatly improves on previous methods in terms of quality of read alignment to the genome and assembly accuracy. To illustrate the usefulness of SEECER for de novo transcriptome studies, we generated new RNA-Seq data to study the development of the sea cucumber Parastichopus parvimensis. Our corrected assembled transcripts shed new light on two important stages in sea cucumber development. Comparison of the assembled transcripts to known transcripts in other species has also revealed novel transcripts that are unique to sea cucumber, some of which we have experimentally validated. Supporting website: http://sb.cs.cmu.edu/seecer/.
机构:
Protein Interactions Group, National Cancer Institute (NCI)-Frederick, National Institutes of Health (NIH), Frederick
Basic Research Program, Science Applications International Corporation-Frederick, Inc., NCI-Frederick, FrederickProtein Interactions Group, National Cancer Institute (NCI)-Frederick, National Institutes of Health (NIH), Frederick
Prabakaran P.
Streaker E.
论文数: 0引用数: 0
h-index: 0
机构:
Protein Interactions Group, National Cancer Institute (NCI)-Frederick, National Institutes of Health (NIH), Frederick
Basic Research Program, Science Applications International Corporation-Frederick, Inc., NCI-Frederick, FrederickProtein Interactions Group, National Cancer Institute (NCI)-Frederick, National Institutes of Health (NIH), Frederick
Streaker E.
Chen W.
论文数: 0引用数: 0
h-index: 0
机构:
Protein Interactions Group, National Cancer Institute (NCI)-Frederick, National Institutes of Health (NIH), FrederickProtein Interactions Group, National Cancer Institute (NCI)-Frederick, National Institutes of Health (NIH), Frederick
Chen W.
Dimitrov D.S.
论文数: 0引用数: 0
h-index: 0
机构:
Protein Interactions Group, National Cancer Institute (NCI)-Frederick, National Institutes of Health (NIH), FrederickProtein Interactions Group, National Cancer Institute (NCI)-Frederick, National Institutes of Health (NIH), Frederick
机构:
Univ Adelaide, Australian Ctr Plant Funct Genom, Urrbrae, SA 5064, Australia
Univ S Australia, Phen & Bioinformat Res Ctr, Mawson Lakes, SA 5095, AustraliaUniv Adelaide, Australian Ctr Plant Funct Genom, Urrbrae, SA 5064, Australia
Sleep, Julie A.
Schreiber, Andreas W.
论文数: 0引用数: 0
h-index: 0
机构:
SA Pathol, Ctr Canc Biol, ACRF South Australian Canc Genome Facil, Adelaide, SA 5000, Australia
Univ Adelaide, Sch Mol & Biomed Sci, Adelaide, SA 5000, AustraliaUniv Adelaide, Australian Ctr Plant Funct Genom, Urrbrae, SA 5064, Australia
Schreiber, Andreas W.
Baumann, Ute
论文数: 0引用数: 0
h-index: 0
机构:
Univ Adelaide, Australian Ctr Plant Funct Genom, Urrbrae, SA 5064, AustraliaUniv Adelaide, Australian Ctr Plant Funct Genom, Urrbrae, SA 5064, Australia