kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison

被引：89

作者：

Leimeister, Chris-Andre ^{[1
]}

Morgenstern, Burkhard ^{[1
,2
]}

机构：

[1] Univ Gottingen, Inst Microbiol & Genet, Dept Bioinformat, D-37073 Gottingen, Germany

[2] Univ Evry Val Essonne, USC INRA, UMR CNRS 8071, Lab Stat & Genom, F-91037 Evry, France

来源：

BIOINFORMATICS | 2014年 / 30卷 / 14期

关键词：

IMPROVEMENTS; DISTANCES;

D O I：

10.1093/bioinformatics/btu331

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: Alignment-based methods for sequence analysis have various limitations if large datasets are to be analysed. Therefore, alignment-free approaches have become popular in recent years. One of the best known alignment-free methods is the average common substring approach that defines a distance measure on sequences based on the average length of longest common words between them. Herein, we generalize this approach by considering longest common substrings with k mismatches. We present a greedy heuristic to approximate the length of such k-mismatch substrings, and we describe kmacs, an efficient implementation of this idea based on generalized enhanced suffix arrays. Results: To evaluate the performance of our approach, we applied it to phylogeny reconstruction using a large number of DNA and protein sequence sets. In most cases, phylogenetic trees calculated with kmacs were more accurate than trees produced with established alignment-free methods that are based on exact word matches. Especially on protein sequences, our method seems to be superior. On simulated protein families, kmacs even outperformed a classical approach to phylogeny reconstruction using multiple alignment and maximum likelihood.

引用

页码：2000 / 2008

页数：9

共 34 条

[21]

Manber Udi., 1990, SODA 90, P319

[22] Genome characteristics of a generalist marine bacterial lineage [J].

Newton, Ryan J. ;

Griffin, Laura E. ;

Bowles, Kathy M. ;

Meile, Christof ;

Gifford, Scott ;

Givens, Carrie E. ;

Howard, Erinn C. ;

King, Eric ;

Oakley, Clinton A. ;

Reisch, Chris R. ;

Rinta-Kanto, Johanna M. ;

Sharma, Shalabh ;

Sun, Shulei ;

Varaljay, Vanessa ;

Vila-Costa, Maria ;

Westrich, Jason R. ;

Moran, Mary Ann .

ISME JOURNAL, 2010, 4 (06) :784-798

[23] CVTree: a phylogenetic tree reconstruction tool based on whole genomes [J].

Qi, J ;

Luo, H ;

Hao, BL .

NUCLEIC ACIDS RESEARCH, 2004, 32 :W45-W47

[24] COMPARISON OF PHYLOGENETIC TREES [J].

ROBINSON, DF ;

FOULDS, LR .

MATHEMATICAL BIOSCIENCES, 1981, 53 (1-2) :131-147

[25] THE NEIGHBOR-JOINING METHOD - A NEW METHOD FOR RECONSTRUCTING PHYLOGENETIC TREES [J].

SAITOU, N ;

NEI, M .

MOLECULAR BIOLOGY AND EVOLUTION, 1987, 4 (04) :406-425

[26] Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega [J].

Sievers, Fabian ;

Wilm, Andreas ;

Dineen, David ;

Gibson, Toby J. ;

Karplus, Kevin ;

Li, Weizhong ;

Lopez, Rodrigo ;

McWilliam, Hamish ;

Remmert, Michael ;

Soeding, Johannes ;

Thompson, Julie D. ;

Higgins, Desmond G. .

MOLECULAR SYSTEMS BIOLOGY, 2011, 7

[27] Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions [J].

Sims, Gregory E. ;

Jun, Se-Ran ;

Wu, Guohong A. ;

Kim, Sung-Hou .

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2009, 106 (08) :2677-2682

[28] Rose: generating sequence families [J].

Stoye, J ;

Evers, D ;

Meyer, F .

BIOINFORMATICS, 1998, 14 (02) :157-163

[29] BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark [J].

Thompson, JD ;

Koehl, P ;

Ripp, R ;

Poch, O .

PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2005, 61 (01) :127-136

[30] CLUSTAL-W - IMPROVING THE SENSITIVITY OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT THROUGH SEQUENCE WEIGHTING, POSITION-SPECIFIC GAP PENALTIES AND WEIGHT MATRIX CHOICE [J].

THOMPSON, JD ;

HIGGINS, DG ;

GIBSON, TJ .

NUCLEIC ACIDS RESEARCH, 1994, 22 (22) :4673-4680

← 1 2 3 4 →