PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme

被引:582
作者
Li, Aimin [1 ,2 ]
Zhang, Junying [1 ]
Zhou, Zhongyin [3 ,4 ]
机构
[1] Xidian Univ, Sch Comp Sci & Technol, Xian, Peoples R China
[2] Xian Univ Technol, Sch Comp Sci & Engn, Xian, Peoples R China
[3] Univ Sci & Technol China, Sch Life Sci, Dept Mol & Cell Biol, Hefei, Peoples R China
[4] Chinese Acad Sci, Kunming Inst Zool, State Key Lab Genet Resources & Evolut, Kunming, Peoples R China
基金
高等学校博士学科点专项科研基金;
关键词
RNA-seq; lncRNA; k-mer; Prediction; de novo sequencing; de novo assemble; REFERENCE SEQUENCES REFSEQ; GENOME ANNOTATION; TRANSCRIPTOME; REVEALS; IDENTIFICATION; EVOLUTION; FEATURES; GENCODE; PAIRS; SEQ;
D O I
10.1186/1471-2105-15-311
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: High-throughput transcriptome sequencing (RNA-seq) technology promises to discover novel protein-coding and non-coding transcripts, particularly the identification of long non-coding RNAs (lncRNAs) from de novo sequencing data. This requires tools that are not restricted by prior gene annotations, genomic sequences and high-quality sequencing. Results: We present an alignment-free tool called PLEK (predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme), which uses a computational pipeline based on an improved k-mer scheme and a support vector machine (SVM) algorithm to distinguish lncRNAs from messenger RNAs (mRNAs), in the absence of genomic sequences or annotations. The performance of PLEK was evaluated on well-annotated mRNA and lncRNA transcripts. 10-fold cross-validation tests on human RefSeq mRNAs and GENCODE lncRNAs indicated that our tool could achieve accuracy of up to 95.6%. We demonstrated the utility of PLEK on transcripts from other vertebrates using the model built from human datasets. PLEK attained >90% accuracy on most of these datasets. PLEK also performed well using a simulated dataset and two real de novo assembled transcriptome datasets (sequenced by PacBio and 454 platforms) with relatively high indel sequencing errors. In addition, PLEK is approximately eightfold faster than a newly developed alignment-free tool, named Coding-Non-Coding Index (CNCI), and 244 times faster than the most popular alignment-based tool, Coding Potential Calculator (CPC), in a single-threading running manner. Conclusions: PLEK is an efficient alignment-free computational tool to distinguish lncRNAs from mRNAs in RNA-seq transcriptomes of species lacking reference genomes. PLEK is especially suitable for PacBio or 454 sequencing data and large-scale transcriptome data.
引用
收藏
页数:10
相关论文
共 58 条
[1]   De novo assembly and validation of planaria transcriptome by massive parallel sequencing and shotgun proteomics [J].
Adamidi, Catherine ;
Wang, Yongbo ;
Gruen, Dominic ;
Mastrobuoni, Guido ;
You, Xintian ;
Tolle, Dominic ;
Dodt, Matthias ;
Mackowiak, Sebastian D. ;
Gogol-Doering, Andreas ;
Oenal, Pinar ;
Rybak, Agnieszka ;
Ross, Eric ;
Alvarado, Alejandro Sanchez ;
Kempa, Stefan ;
Dieterich, Christoph ;
Rajewsky, Nikolaus ;
Chen, Wei .
GENOME RESEARCH, 2011, 21 (07) :1193-1200
[2]   Long Noncoding RNAs: Cellular Address Codes in Development and Disease [J].
Batista, Pedro J. ;
Chang, Howard Y. .
CELL, 2013, 152 (06) :1298-1307
[3]   Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses [J].
Cabili, Moran N. ;
Trapnell, Cole ;
Goff, Loyal ;
Koziol, Magdalena ;
Tazon-Vega, Barbara ;
Regev, Aviv ;
Rinn, John L. .
GENES & DEVELOPMENT, 2011, 25 (18) :1915-1927
[4]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[5]   DNA sequence quality trimming and vector removal [J].
Chou, HH ;
Holmes, MH .
BIOINFORMATICS, 2001, 17 (12) :1093-1104
[6]   The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression [J].
Derrien, Thomas ;
Johnson, Rory ;
Bussotti, Giovanni ;
Tanzer, Andrea ;
Djebali, Sarah ;
Tilgner, Hagen ;
Guernec, Gregory ;
Martin, David ;
Merkel, Angelika ;
Knowles, David G. ;
Lagarde, Julien ;
Veeravalli, Lavanya ;
Ruan, Xiaoan ;
Ruan, Yijun ;
Lassmann, Timo ;
Carninci, Piero ;
Brown, James B. ;
Lipovich, Leonard ;
Gonzalez, Jose M. ;
Thomas, Mark ;
Davis, Carrie A. ;
Shiekhattar, Ramin ;
Gingeras, Thomas R. ;
Hubbard, Tim J. ;
Notredame, Cedric ;
Harrow, Jennifer ;
Guigo, Roderic .
GENOME RESEARCH, 2012, 22 (09) :1775-1789
[7]   miRFam: an effective automatic miRNA classification method based on n-grams and a multiclass SVM [J].
Ding, Jiandong ;
Zhou, Shuigeng ;
Guan, Jihong .
BMC BIOINFORMATICS, 2011, 12
[8]   Reassessing the Determinants of Breeding Synchrony in Ungulates [J].
English, Annie K. ;
Chauvenet, Alienor L. M. ;
Safi, Kamran ;
Pettorelli, Nathalie .
PLOS ONE, 2012, 7 (07)
[9]   ASSESSMENT OF PROTEIN CODING MEASURES [J].
FICKETT, JW ;
TUNG, CS .
NUCLEIC ACIDS RESEARCH, 1992, 20 (24) :6441-6450
[10]   Ensembl 2013 [J].
Flicek, Paul ;
Ahmed, Ikhlak ;
Amode, M. Ridwan ;
Barrell, Daniel ;
Beal, Kathryn ;
Brent, Simon ;
Carvalho-Silva, Denise ;
Clapham, Peter ;
Coates, Guy ;
Fairley, Susan ;
Fitzgerald, Stephen ;
Gil, Laurent ;
Garcia-Giron, Carlos ;
Gordon, Leo ;
Hourlier, Thibaut ;
Hunt, Sarah ;
Juettemann, Thomas ;
Kaehaeri, Andreas K. ;
Keenan, Stephen ;
Komorowska, Monika ;
Kulesha, Eugene ;
Longden, Ian ;
Maurel, Thomas ;
McLaren, William M. ;
Muffato, Matthieu ;
Nag, Rishi ;
Overduin, Bert ;
Pignatelli, Miguel ;
Pritchard, Bethan ;
Pritchard, Emily ;
Riat, Harpreet Singh ;
Ritchie, Graham R. S. ;
Ruffier, Magali ;
Schuster, Michael ;
Sheppard, Daniel ;
Sobral, Daniel ;
Taylor, Kieron ;
Thormann, Anja ;
Trevanion, Stephen ;
White, Simon ;
Wilder, Steven P. ;
Aken, Bronwen L. ;
Birney, Ewan ;
Cunningham, Fiona ;
Dunham, Ian ;
Harrow, Jennifer ;
Herrero, Javier ;
Hubbard, Tim J. P. ;
Johnson, Nathan ;
Kinsella, Rhoda .
NUCLEIC ACIDS RESEARCH, 2013, 41 (D1) :D48-D55