kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison

被引：89

作者：

Leimeister, Chris-Andre ^{[1
]}

Morgenstern, Burkhard ^{[1
,2
]}

机构：

[1] Univ Gottingen, Inst Microbiol & Genet, Dept Bioinformat, D-37073 Gottingen, Germany

[2] Univ Evry Val Essonne, USC INRA, UMR CNRS 8071, Lab Stat & Genom, F-91037 Evry, France

来源：

BIOINFORMATICS | 2014年 / 30卷 / 14期

关键词：

IMPROVEMENTS; DISTANCES;

D O I：

10.1093/bioinformatics/btu331

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: Alignment-based methods for sequence analysis have various limitations if large datasets are to be analysed. Therefore, alignment-free approaches have become popular in recent years. One of the best known alignment-free methods is the average common substring approach that defines a distance measure on sequences based on the average length of longest common words between them. Herein, we generalize this approach by considering longest common substrings with k mismatches. We present a greedy heuristic to approximate the length of such k-mismatch substrings, and we describe kmacs, an efficient implementation of this idea based on generalized enhanced suffix arrays. Results: To evaluate the performance of our approach, we applied it to phylogeny reconstruction using a large number of DNA and protein sequence sets. In most cases, phylogenetic trees calculated with kmacs were more accurate than trees produced with established alignment-free methods that are based on exact word matches. Especially on protein sequences, our method seems to be superior. On simulated protein families, kmacs even outperformed a classical approach to phylogeny reconstruction using multiple alignment and maximum likelihood.

引用

页码：2000 / 2008

页数：9

共 34 条

[1]

Abouelhoda M. I., 2004, Journal of Discrete Algorithms, V2, P53, DOI 10.1016/S1570-8667(03)00065-0

[2]

[Anonymous], 2005, PHYLIP (phylogeny inference package) version 3.6

[3]

Babenko MA, 2008, LECT NOTES COMPUT SC, V5010, P64, DOI 10.1007/978-3-540-79709-8_10

[4]

Boden M., 2013, OPENACCESS SERIES IN, P21

[5] Alignment-free phylogeny of whole genomes using underlying subwords [J].

Comin, Matteo ;

Verzotto, Davide .

ALGORITHMS FOR MOLECULAR BIOLOGY, 2012, 7

[6] Variable length local decoding and alignment-free sequence comparison [J].

Didier, Gilles ;

Corel, Eduardo ;

Laprevotte, Ivan ;

Grossmann, Alex ;

Landes-Devauchelle, Claudine .

THEORETICAL COMPUTER SCIENCE, 2012, 462 :1-11

[7] Efficient estimation of pairwise distances between genomes [J].

Domazet-Loso, Mirjana ;

Haubold, Bernhard .

BIOINFORMATICS, 2009, 25 (24) :3221-3227

[8] EVOLUTIONARY TREES FROM DNA-SEQUENCES - A MAXIMUM-LIKELIHOOD APPROACH [J].

FELSENSTEIN, J .

JOURNAL OF MOLECULAR EVOLUTION, 1981, 17 (06) :368-376

[9]

Fischer J, 2007, LECT NOTES COMPUT SC, V4614, P459

[10]

Fischer J, 2006, LECT NOTES COMPUT SC, V4009, P36

← 1 2 3 4 →