Gene prediction in metagenomic fragments based on the SVM algorithm

被引:71
作者
Liu, Yongchu [1 ,2 ,3 ]
Guo, Jiangtao [1 ,2 ,3 ]
Hu, Gangqing [1 ,2 ,3 ,5 ]
Zhu, Huaiqiu [1 ,2 ,3 ,4 ]
机构
[1] Peking Univ, State Key Lab Turbulence & Complex Syst, Beijing 100871, Peoples R China
[2] Peking Univ, Dept Biomed Engn, Coll Engn, Beijing 100871, Peoples R China
[3] Peking Univ, Ctr Theoret Biol, Beijing 100871, Peoples R China
[4] Peking Univ, Ctr Prot Sci, Beijing 100871, Peoples R China
[5] NHLBI, Lab Mol Immunol, NIH, Bethesda, MD 20892 USA
来源
BMC BIOINFORMATICS | 2013年 / 14卷
基金
中国国家自然科学基金;
关键词
TRANSLATION INITIATION SITE; SUPPORT-VECTOR-MACHINE; MICROBIAL GENOMES; IDENTIFICATION; BACTERIAL; ANNOTATION; SEQUENCES; RECOGNITION; ARCHAEAL; TOOL;
D O I
10.1186/1471-2105-14-S5-S12
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Metagenomic sequencing is becoming a powerful technology for exploring micro-ogranisms from various environments, such as human body, without isolation and cultivation. Accurately identifying genes from metagenomic fragments is one of the most fundamental issues. Results: In this article, we present a novel gene prediction method named MetaGUN for metagenomic fragments based on a machine learning approach of SVM. It implements in a three-stage strategy to predict genes. Firstly, it classifies input fragments into phylogenetic groups by a k-mer based sequence binning method. Then, protein-coding sequences are identified for each group independently with SVM classifiers that integrate entropy density profiles (EDP) of codon usage, translation initiation site (TIS) scores and open reading frame (ORF) length as input patterns. Finally, the TISs are adjusted by employing a modified version of MetaTISA. To identify protein-coding sequences, MetaGun builds the universal module and the novel module. The former is based on a set of representative species, while the latter is designed to find potential functionary DNA sequences with conserved domains. Conclusions: Comparisons on artificial shotgun fragments with multiple current metagenomic gene finders show that MetaGUN predicts better results on both 3' and 5' ends of genes with fragments of various lengths. Especially, it makes the most reliable predictions among these methods. As an application, MetaGUN was used to predict genes for two samples of human gut microbiome. It identifies thousands of additional genes with significant evidences. Further analysis indicates that MetaGUN tends to predict more potential novel genes than other current metagenomic gene finders.
引用
收藏
页数:12
相关论文
共 49 条
[1]  
Angelova M., 2010, ICT Innovations 2010 Web Proceedings ISSN, P11
[2]  
Antonov Ivan, 2010, Journal of Bioinformatics and Computational Biology, V8, P535, DOI 10.1142/S0219720010004847
[3]   CRITICA: Coding region identification tool invoking comparative analysis [J].
Badger, JH ;
Olsen, GJ .
MOLECULAR BIOLOGY AND EVOLUTION, 1999, 16 (04) :512-524
[4]   GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions [J].
Besemer, J ;
Lomsadze, A ;
Borodovsky, M .
NUCLEIC ACIDS RESEARCH, 2001, 29 (12) :2607-2618
[5]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[6]   Improved microbial gene identification with GLIMMER [J].
Delcher, AL ;
Harmon, D ;
Kasif, S ;
White, O ;
Salzberg, SL .
NUCLEIC ACIDS RESEARCH, 1999, 27 (23) :4636-4641
[7]   Identifying bacterial genes and endosymbiont DNA with Glimmer [J].
Delcher, Arthur L. ;
Bratke, Kirsten A. ;
Powers, Edwin C. ;
Salzberg, Steven L. .
BIOINFORMATICS, 2007, 23 (06) :673-679
[8]   Combining diverse evidence for gene recognition in completely sequenced bacterial genomes [J].
Frishman, D ;
Mironov, A ;
Mewes, HW ;
Gelfand, M .
NUCLEIC ACIDS RESEARCH, 1998, 26 (12) :2941-2947
[9]   Metagenomic analysis of the human distal gut microbiome [J].
Gill, Steven R. ;
Pop, Mihai ;
DeBoy, Robert T. ;
Eckburg, Paul B. ;
Turnbaugh, Peter J. ;
Samuel, Buck S. ;
Gordon, Jeffrey I. ;
Relman, David A. ;
Fraser-Liggett, Claire M. ;
Nelson, Karen E. .
SCIENCE, 2006, 312 (5778) :1355-1359
[10]   Using support vector machine combined with auto covariance to predict proteinprotein interactions from protein sequences [J].
Guo, Yanzhi ;
Yu, Lezheng ;
Wen, Zhining ;
Li, Menglong .
NUCLEIC ACIDS RESEARCH, 2008, 36 (09) :3025-3030