IDBA-tran: a more robust de novo de Bruijn graph assembler for transcriptomes with uneven expression levels

被引:139
作者
Peng, Yu [1 ]
Leung, Henry C. M. [1 ]
Yiu, Siu-Ming [1 ]
Lv, Ming-Ju [2 ]
Zhu, Xin-Guang [2 ]
Chin, Francis Y. L. [1 ]
机构
[1] Univ Hong Kong, Dept Comp Sci, Hong Kong, Hong Kong, Peoples R China
[2] Chinese Acad Sci, Shanghai Inst Biol Sci, CAS MPG Partner Inst Computat Biol, Shanghai 200031, Peoples R China
关键词
RNA-SEQ DATA; SINGLE-CELL; ISOFORM EXPRESSION; SEQUENCING DATA; GENOME; REVEALS;
D O I
10.1093/bioinformatics/btt219
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: RNA sequencing based on next-generation sequencing technology is effective for analyzing transcriptomes. Like de novo genome assembly, de novo transcriptome assembly does not rely on any reference genome or additional annotation information, but is more difficult. In particular, isoforms can have very uneven expression levels (e. g. 1:100), which make it very difficult to identify low-expressed isoforms. One challenge is to remove erroneous vertices/edges with high multiplicity (produced by high-expressed isoforms) in the de Bruijn graph without removing correct ones with not-so-high multiplicity from low-expressed isoforms. Failing to do so will result in the loss of low-expressed isoforms or having complicated subgraphs with transcripts of different genes mixed together due to erroneous vertices/edges. Contributions: Unlike existing tools, which remove erroneous vertices/edges with multiplicities lower than a global threshold, we use a probabilistic progressive approach to iteratively remove them with local thresholds. This enables us to decompose the graph into disconnected components, each containing a few genes, if not a single gene, while retaining many correct vertices/edges of low-expressed isoforms. Combined with existing techniques, IDBA-Tran is able to assemble both high-expressed and low-expressed transcripts and outperform existing assemblers in terms of sensitivity and specificity for both simulated and real data.
引用
收藏
页码:326 / 334
页数:9
相关论文
共 21 条
[1]   SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing [J].
Bankevich, Anton ;
Nurk, Sergey ;
Antipov, Dmitry ;
Gurevich, Alexey A. ;
Dvorkin, Mikhail ;
Kulikov, Alexander S. ;
Lesin, Valery M. ;
Nikolenko, Sergey I. ;
Son Pham ;
Prjibelski, Andrey D. ;
Pyshkin, Alexey V. ;
Sirotkin, Alexander V. ;
Vyahhi, Nikolay ;
Tesler, Glenn ;
Alekseyev, Max A. ;
Pevzner, Pavel A. .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2012, 19 (05) :455-477
[2]   Efficient de novo assembly of single-cell bacterial genomes from short-read data sets [J].
Chitsaz, Hamidreza ;
Yee-Greenbaum, Joyclyn L. ;
Tesler, Glenn ;
Lombardo, Mary-Jane ;
Dupont, Christopher L. ;
Badger, Jonathan H. ;
Novotny, Mark ;
Rusch, Douglas B. ;
Fraser, Louise J. ;
Gormley, Niall A. ;
Schulz-Trieglaff, Ole ;
Smith, Geoffrey P. ;
Evers, Dirk J. ;
Pevzner, Pavel A. ;
Lasken, Roger S. .
NATURE BIOTECHNOLOGY, 2011, 29 (10) :915-U214
[3]   Full-length transcriptome assembly from RNA-Seq data without a reference genome [J].
Grabherr, Manfred G. ;
Haas, Brian J. ;
Yassour, Moran ;
Levin, Joshua Z. ;
Thompson, Dawn A. ;
Amit, Ido ;
Adiconis, Xian ;
Fan, Lin ;
Raychowdhury, Raktima ;
Zeng, Qiandong ;
Chen, Zehua ;
Mauceli, Evan ;
Hacohen, Nir ;
Gnirke, Andreas ;
Rhind, Nicholas ;
di Palma, Federica ;
Birren, Bruce W. ;
Nusbaum, Chad ;
Lindblad-Toh, Kerstin ;
Friedman, Nir ;
Regev, Aviv .
NATURE BIOTECHNOLOGY, 2011, 29 (07) :644-U130
[4]   Molecular biology - Power sequencing [J].
Graveley, Brenton R. .
NATURE, 2008, 453 (7199) :1197-1198
[5]   Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs [J].
Guttman, Mitchell ;
Garber, Manuel ;
Levin, Joshua Z. ;
Donaghey, Julie ;
Robinson, James ;
Adiconis, Xian ;
Fan, Lin ;
Koziol, Magdalena J. ;
Gnirke, Andreas ;
Nusbaum, Chad ;
Rinn, John L. ;
Lander, Eric S. ;
Regev, Aviv .
NATURE BIOTECHNOLOGY, 2010, 28 (05) :503-U166
[6]   Statistical inferences for isoform expression in RNA-Seq [J].
Jiang, Hui ;
Wong, Wing Hung .
BIOINFORMATICS, 2009, 25 (08) :1026-1032
[7]  
Kent WJ, 2002, GENOME RES, V12, P656, DOI [10.1101/gr.229202, 10.1101/gr.229202. Article published online before March 2002]
[8]   The sequence and de novo assembly of the giant panda genome [J].
Li, Ruiqiang ;
Fan, Wei ;
Tian, Geng ;
Zhu, Hongmei ;
He, Lin ;
Cai, Jing ;
Huang, Quanfei ;
Cai, Qingle ;
Li, Bo ;
Bai, Yinqi ;
Zhang, Zhihe ;
Zhang, Yaping ;
Wang, Wen ;
Li, Jun ;
Wei, Fuwen ;
Li, Heng ;
Jian, Min ;
Li, Jianwen ;
Zhang, Zhaolei ;
Nielsen, Rasmus ;
Li, Dawei ;
Gu, Wanjun ;
Yang, Zhentao ;
Xuan, Zhaoling ;
Ryder, Oliver A. ;
Leung, Frederick Chi-Ching ;
Zhou, Yan ;
Cao, Jianjun ;
Sun, Xiao ;
Fu, Yonggui ;
Fang, Xiaodong ;
Guo, Xiaosen ;
Wang, Bo ;
Hou, Rong ;
Shen, Fujun ;
Mu, Bo ;
Ni, Peixiang ;
Lin, Runmao ;
Qian, Wubin ;
Wang, Guodong ;
Yu, Chang ;
Nie, Wenhui ;
Wang, Jinhuan ;
Wu, Zhigang ;
Liang, Huiqing ;
Min, Jiumeng ;
Wu, Qi ;
Cheng, Shifeng ;
Ruan, Jue ;
Wang, Mingwei .
NATURE, 2010, 463 (7279) :311-317
[9]   Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads [J].
Li, Wei ;
Jiang, Tao .
BIOINFORMATICS, 2012, 28 (22) :2914-2921
[10]   The transcriptional landscape of the yeast genome defined by RNA sequencing [J].
Nagalakshmi, Ugrappa ;
Wang, Zhong ;
Waern, Karl ;
Shou, Chong ;
Raha, Debasish ;
Gerstein, Mark ;
Snyder, Michael .
SCIENCE, 2008, 320 (5881) :1344-1349