kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison

被引:89
作者
Leimeister, Chris-Andre [1 ]
Morgenstern, Burkhard [1 ,2 ]
机构
[1] Univ Gottingen, Inst Microbiol & Genet, Dept Bioinformat, D-37073 Gottingen, Germany
[2] Univ Evry Val Essonne, USC INRA, UMR CNRS 8071, Lab Stat & Genom, F-91037 Evry, France
关键词
IMPROVEMENTS; DISTANCES;
D O I
10.1093/bioinformatics/btu331
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Alignment-based methods for sequence analysis have various limitations if large datasets are to be analysed. Therefore, alignment-free approaches have become popular in recent years. One of the best known alignment-free methods is the average common substring approach that defines a distance measure on sequences based on the average length of longest common words between them. Herein, we generalize this approach by considering longest common substrings with k mismatches. We present a greedy heuristic to approximate the length of such k-mismatch substrings, and we describe kmacs, an efficient implementation of this idea based on generalized enhanced suffix arrays. Results: To evaluate the performance of our approach, we applied it to phylogeny reconstruction using a large number of DNA and protein sequence sets. In most cases, phylogenetic trees calculated with kmacs were more accurate than trees produced with established alignment-free methods that are based on exact word matches. Especially on protein sequences, our method seems to be superior. On simulated protein families, kmacs even outperformed a classical approach to phylogeny reconstruction using multiple alignment and maximum likelihood.
引用
收藏
页码:2000 / 2008
页数:9
相关论文
共 34 条
[21]  
Manber Udi., 1990, SODA 90, P319
[22]   Genome characteristics of a generalist marine bacterial lineage [J].
Newton, Ryan J. ;
Griffin, Laura E. ;
Bowles, Kathy M. ;
Meile, Christof ;
Gifford, Scott ;
Givens, Carrie E. ;
Howard, Erinn C. ;
King, Eric ;
Oakley, Clinton A. ;
Reisch, Chris R. ;
Rinta-Kanto, Johanna M. ;
Sharma, Shalabh ;
Sun, Shulei ;
Varaljay, Vanessa ;
Vila-Costa, Maria ;
Westrich, Jason R. ;
Moran, Mary Ann .
ISME JOURNAL, 2010, 4 (06) :784-798
[23]   CVTree: a phylogenetic tree reconstruction tool based on whole genomes [J].
Qi, J ;
Luo, H ;
Hao, BL .
NUCLEIC ACIDS RESEARCH, 2004, 32 :W45-W47
[24]   COMPARISON OF PHYLOGENETIC TREES [J].
ROBINSON, DF ;
FOULDS, LR .
MATHEMATICAL BIOSCIENCES, 1981, 53 (1-2) :131-147
[25]   THE NEIGHBOR-JOINING METHOD - A NEW METHOD FOR RECONSTRUCTING PHYLOGENETIC TREES [J].
SAITOU, N ;
NEI, M .
MOLECULAR BIOLOGY AND EVOLUTION, 1987, 4 (04) :406-425
[26]   Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega [J].
Sievers, Fabian ;
Wilm, Andreas ;
Dineen, David ;
Gibson, Toby J. ;
Karplus, Kevin ;
Li, Weizhong ;
Lopez, Rodrigo ;
McWilliam, Hamish ;
Remmert, Michael ;
Soeding, Johannes ;
Thompson, Julie D. ;
Higgins, Desmond G. .
MOLECULAR SYSTEMS BIOLOGY, 2011, 7
[27]   Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions [J].
Sims, Gregory E. ;
Jun, Se-Ran ;
Wu, Guohong A. ;
Kim, Sung-Hou .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2009, 106 (08) :2677-2682
[28]   Rose: generating sequence families [J].
Stoye, J ;
Evers, D ;
Meyer, F .
BIOINFORMATICS, 1998, 14 (02) :157-163
[29]   BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark [J].
Thompson, JD ;
Koehl, P ;
Ripp, R ;
Poch, O .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2005, 61 (01) :127-136
[30]   CLUSTAL-W - IMPROVING THE SENSITIVITY OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT THROUGH SEQUENCE WEIGHTING, POSITION-SPECIFIC GAP PENALTIES AND WEIGHT MATRIX CHOICE [J].
THOMPSON, JD ;
HIGGINS, DG ;
GIBSON, TJ .
NUCLEIC ACIDS RESEARCH, 1994, 22 (22) :4673-4680