A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF

被引:32
作者
Cong, Yingnan [1 ,2 ]
Chan, Yao-ban [3 ]
Ragan, Mark A. [1 ,2 ]
机构
[1] Univ Queensland, Inst Mol Biosci, Brisbane, Qld 4072, Australia
[2] Univ Queensland, ARC Ctr Excellence Bioinformat, Brisbane, Qld 4072, Australia
[3] Univ Melbourne, Sch Math & Stat, Melbourne, Vic 3010, Australia
来源
SCIENTIFIC REPORTS | 2016年 / 6卷
关键词
STAPHYLOCOCCUS-AUREUS; GENOMES; DNA; EVOLUTION; SEQUENCE;
D O I
10.1038/srep30308
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Lateral genetic transfer (LGT) plays an important role in the evolution of microbes. Existing computational methods for detecting genomic regions of putative lateral origin scale poorly to large data. Here, we propose a novel method based on TF-IDF (Term Frequency-Inverse Document Frequency) statistics to detect not only regions of lateral origin, but also their origin and direction of transfer, in sets of hierarchically structured nucleotide or protein sequences. This approach is based on the frequency distributions of k-mers in the sequences. If a set of contiguous k-mers appears sufficiently more frequently in another phyletic group than in its own, we infer that they have been transferred from the first group to the second. We performed rigorous tests of TF-IDF using simulated and empirical datasets. With the simulated data, we tested our method under different parameter settings for sequence length, substitution rate between and within groups and post-LGT, deletion rate, length of transferred region and k size, and found that we can detect LGT events with high precision and recall. Our method performs better than an established method, ALFY, which has high recall but low precision. Our method is efficient, with runtime increasing approximately linearly with sequence length.
引用
收藏
页数:13
相关论文
共 55 条
[1]  
[Anonymous], 1971, The SMART Retrieval System-Experiments in Automatic Document Processing
[2]   Highways of gene sharing in prokaryotes [J].
Beiko, RG ;
Harlow, TJ ;
Ragan, MA .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2005, 102 (40) :14332-14337
[3]   The Distribution of Word Matches Between Markovian Sequences with Periodic Boundary Conditions [J].
Burden, Conrad J. ;
Leopardi, Paul ;
Foret, Sylvain .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2014, 21 (01) :41-63
[4]   Are Protein Domains Modules of Lateral Genetic Transfer? [J].
Chan, Cheong Xin ;
Darling, Aaron E. ;
Beiko, Robert G. ;
Ragan, Mark A. .
PLOS ONE, 2009, 4 (02)
[5]   Genomic DNA k-mer spectra: models and modalities [J].
Chor, Benny ;
Horn, David ;
Goldman, Nick ;
Levy, Yaron ;
Massingham, Tim .
GENOME BIOLOGY, 2009, 10 (10)
[6]   Exploring lateral genetic transfer among microbial genomes using TF-IDF [J].
Cong, Yingnan ;
Chan, Yao-ban ;
Ragan, Mark A. .
SCIENTIFIC REPORTS, 2016, 6
[7]   ALF-A Simulation Framework for Genome Evolution [J].
Dalquen, Daniel A. ;
Anisimova, Maria ;
Gonnet, Gaston H. ;
Dessimoz, Christophe .
MOLECULAR BIOLOGY AND EVOLUTION, 2012, 29 (04) :1115-1123
[8]   Origins and Evolution of Antibiotic Resistance [J].
Davies, Julian ;
Davies, Dorothy .
MICROBIOLOGY AND MOLECULAR BIOLOGY REVIEWS, 2010, 74 (03) :417-+
[9]   INTRODUCTION TO MODERN INFORMATION-RETRIEVAL - SALTON,G, MCGILL,M [J].
DILLON, M .
INFORMATION PROCESSING & MANAGEMENT, 1983, 19 (06) :402-403
[10]  
Domazet-Loso Mirjana, 2011, Mob Genet Elements, V1, P230