The average common substring approach to phylogenomic reconstruction

被引：147

作者：

Ulitsky, I ^{[1
]}

Burstein, D ^{[1
]}

Tuller, T ^{[1
]}

Chor, B ^{[1
]}

机构：

[1] Tel Aviv Univ, Sch Comp Sci, IL-69978 Tel Aviv, Israel

来源：

JOURNAL OF COMPUTATIONAL BIOLOGY | 2006年 / 13卷 / 02期

关键词：

phylogenomics; whole genome and proteome phylogeny; compressibility; distance matrix;

D O I：

10.1089/cmb.2006.13.336

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

We describe a novel method for efficient reconstruction of phylogenetic trees, based on sequences of whole genomes or proteomes, whose lengths may greatly vary. The core of our method is a new measure of pairwise distances between sequences. This measure is based on computing the average lengths of maximum common substrings, which is intrinsically related to information theoretic tools (Kullback-Leibler relative entropy). We present an algorithm for efficiently computing these distances. In principle, the distance of two l long sequences can be calculated in O(l) time. We implemented the algorithm using suffix arrays our implementation is fast enough to enable the construction of the proteome phylogenomic tree for hundreds of species and the genome phylogenomic forest for almost two thousand viruses. An initial analysis of the results exhibits a remarkable agreement with "acceptable phylogenetic and taxonomic truth." To assess our approach, our results were compared to the traditional (single-gene or protein-based) maximum likelihood method. The obtained trees were compared to implementations of a number of alternative approaches, including two that were previously published in the literature, and to the published results of a third approach. Comparing their outcome and running time to ours, using a "traditional" trees and a standard tree comparison method, our algorithm improved upon the "competition" by a substantial margin. The simplicity and speed of our method allows for a whole genome analysis with the greatest scope attempted so far. We describe here five different applications of the method, which not only show the validity of the method, but also suggest a number of novel phylogenetic insights.

引用

页码：336 / 350

页数：15

共 47 条

[1]

ACACHI J, 2004, PROTML MAXIMUM LIKEL

[2]

[Anonymous], BIOINFORMATICS

[3] Constructing phylogenies from quartets: Elucidation of eutherian superordinal relationships [J].

Ben-Dor, A ;

Chor, B ;

Graur, D ;

Ophir, R ;

Pelleg, D .

JOURNAL OF COMPUTATIONAL BIOLOGY, 1998, 5 (03) :377-390

[4]

Bininda-Emonds O., 2004, PHYLOGENETIC SUPERTR

[5] Molecular and enzymatic characterization of the porcine endogenous retrovirus protease [J].

Blusch, EH ;

Seelmeir, S ;

von der Helm, K .

JOURNAL OF VIROLOGY, 2002, 76 (15) :7913-7917

[6]

Bourque G, 2002, GENOME RES, V12, P26

[7] The phylogeny of new world (Tacaribe complex) arenaviruses [J].

Bowen, MD ;

Peters, CJ ;

Nichol, ST .

VIROLOGY, 1996, 219 (01) :285-290

[8] In silico identification, structure prediction and phylogenetic analysis of the 2′-O-ribose (cap 1) methyltransferase domain in the large structural protein of ssRNA negative-strand viruses [J].

Bujnicki, JM ;

Rychlewski, L .

PROTEIN ENGINEERING, 2002, 15 (02) :101-108

[9]

BURKHARDT S, 2003, LNCS, V2676

[10] New insights into the evolutionary relationships between arenaviruses provided by comparative analysis of small and large segment sequences [J].

Charrel, RN ;

Lemasson, JJ ;

Garbutt, M ;

Khelifa, R ;

De Micco, P ;

Feldmann, H ;

de Lamballerie, X .

VIROLOGY, 2003, 317 (02) :191-196

← 1 2 3 4 5 →