kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison

被引:89
作者
Leimeister, Chris-Andre [1 ]
Morgenstern, Burkhard [1 ,2 ]
机构
[1] Univ Gottingen, Inst Microbiol & Genet, Dept Bioinformat, D-37073 Gottingen, Germany
[2] Univ Evry Val Essonne, USC INRA, UMR CNRS 8071, Lab Stat & Genom, F-91037 Evry, France
关键词
IMPROVEMENTS; DISTANCES;
D O I
10.1093/bioinformatics/btu331
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Alignment-based methods for sequence analysis have various limitations if large datasets are to be analysed. Therefore, alignment-free approaches have become popular in recent years. One of the best known alignment-free methods is the average common substring approach that defines a distance measure on sequences based on the average length of longest common words between them. Herein, we generalize this approach by considering longest common substrings with k mismatches. We present a greedy heuristic to approximate the length of such k-mismatch substrings, and we describe kmacs, an efficient implementation of this idea based on generalized enhanced suffix arrays. Results: To evaluate the performance of our approach, we applied it to phylogeny reconstruction using a large number of DNA and protein sequence sets. In most cases, phylogenetic trees calculated with kmacs were more accurate than trees produced with established alignment-free methods that are based on exact word matches. Especially on protein sequences, our method seems to be superior. On simulated protein families, kmacs even outperformed a classical approach to phylogeny reconstruction using multiple alignment and maximum likelihood.
引用
收藏
页码:2000 / 2008
页数:9
相关论文
共 34 条
[1]  
Abouelhoda M. I., 2004, Journal of Discrete Algorithms, V2, P53, DOI 10.1016/S1570-8667(03)00065-0
[2]  
[Anonymous], 2005, PHYLIP (phylogeny inference package) version 3.6
[3]  
Babenko MA, 2008, LECT NOTES COMPUT SC, V5010, P64, DOI 10.1007/978-3-540-79709-8_10
[4]  
Boden M., 2013, OPENACCESS SERIES IN, P21
[5]   Alignment-free phylogeny of whole genomes using underlying subwords [J].
Comin, Matteo ;
Verzotto, Davide .
ALGORITHMS FOR MOLECULAR BIOLOGY, 2012, 7
[6]   Variable length local decoding and alignment-free sequence comparison [J].
Didier, Gilles ;
Corel, Eduardo ;
Laprevotte, Ivan ;
Grossmann, Alex ;
Landes-Devauchelle, Claudine .
THEORETICAL COMPUTER SCIENCE, 2012, 462 :1-11
[7]   Efficient estimation of pairwise distances between genomes [J].
Domazet-Loso, Mirjana ;
Haubold, Bernhard .
BIOINFORMATICS, 2009, 25 (24) :3221-3227
[8]   EVOLUTIONARY TREES FROM DNA-SEQUENCES - A MAXIMUM-LIKELIHOOD APPROACH [J].
FELSENSTEIN, J .
JOURNAL OF MOLECULAR EVOLUTION, 1981, 17 (06) :368-376
[9]  
Fischer J, 2007, LECT NOTES COMPUT SC, V4614, P459
[10]  
Fischer J, 2006, LECT NOTES COMPUT SC, V4009, P36