Accounting for Solvent Accessibility and Secondary Structure in Protein Phylogenetics Is Clearly Beneficial

被引:73
作者
Le, Si Quang [1 ,2 ]
Gascuel, Olivier [1 ]
机构
[1] Univ Montpellier 2, CNRS, LIRMM, F-34392 Montpellier 5, France
[2] Wellcome Trust Sanger Inst, Cambridge CB10 1SA, England
关键词
Amino-acid substitutions; maximum likelihood; partition models; replacement rate matrices; structural annotation of proteins; topological impact; SEQUENCE; MODEL; ALGORITHM; TREE; INFERENCE; SELECTION; EVOLUTION; RATES;
D O I
10.1093/sysbio/syq002
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Amino acid substitution models are essential to most methods to infer phylogenies from protein data. These models represent the ways in which proteins evolve and substitutions accumulate along the course of time. It is widely accepted that the substitution processes vary depending on the structural configuration of the protein residues. However, this information is very rarely used in phylogenetic studies, though the 3-dimensional structure of dozens of thousands of proteins has been elucidated. Here, we reinvestigate the question in order to fill this gap. We use an improved estimation methodology and a very large database comprising 1471 nonredundant globular protein alignments with structural annotations to estimate new amino acid substitution models accounting for the secondary structure and solvent accessibility of the residues. These models incorporate a confidence coefficient that is estimated from the data and reflects the reliability and usefulness of structural annotations in the analyzed sequences. Our results with 300 independent test alignments show an impressive likelihood gain compared with standard models such as JTT or WAG. Moreover, the use of these models induces significant topological changes in the inferred trees, which should be of primary interest to phylogeneticists. Our data, models, and software are available for download from http://atgc.lirmm.fr/phyml-structure/.
引用
收藏
页码:277 / 287
页数:11
相关论文
共 42 条
[1]   NEW LOOK AT STATISTICAL-MODEL IDENTIFICATION [J].
AKAIKE, H .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1974, AC19 (06) :716-723
[2]  
Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkh121, 10.1093/nar/gkr1065]
[3]   The Protein Data Bank [J].
Berman, HM ;
Westbrook, J ;
Feng, Z ;
Gilliland, G ;
Bhat, TN ;
Weissig, H ;
Shindyalov, IN ;
Bourne, PE .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :235-242
[4]   On the interpretation of bootstrap trees: Appropriate threshold of clade selection and induced gain [J].
Berry, V ;
Gascuel, O .
MOLECULAR BIOLOGY AND EVOLUTION, 1996, 13 (07) :999-1011
[5]  
Branden C., 1999, INTRO PROTEIN STRUCT
[6]  
Bryant D, 2005, MATHEMATICS OF EVOLUTION AND PHYLOGENY, P33
[7]   THE RELATION BETWEEN THE DIVERGENCE OF SEQUENCE AND STRUCTURE IN PROTEINS [J].
CHOTHIA, C ;
LESK, AM .
EMBO JOURNAL, 1986, 5 (04) :823-826
[8]  
Dayhoff M O., 1978, Atlas of Protein Seq Struct, ppp 345
[9]  
Felsenstein J., 2003, Inferring phylogenies
[10]   BIONJ: An improved version of the NJ algorithm based on a simple model of sequence data [J].
Gascuel, O .
MOLECULAR BIOLOGY AND EVOLUTION, 1997, 14 (07) :685-695