A Penalized-Likelihood Method to Estimate the Distribution of Selection Coefficients from Phylogenetic Data

被引:37
作者
Tamuri, Asif U. [1 ]
Goldman, Nick [1 ]
dos Reis, Mario [2 ]
机构
[1] European Bioinformat Inst, European Mol Biol Lab, Cambridge CB10 1SD, England
[2] UCL, Dept Genet Evolut & Environm, London WC1E 6BT, England
基金
英国生物技术与生命科学研究理事会;
关键词
fitness effects; selection coefficient; penalized likelihood; mitochondria; chloroplast; influenza; MAXIMUM-LIKELIHOOD; MOLECULAR EVOLUTION; POPULATION-GENETICS; MITOCHONDRIAL-DNA; PROTEIN EVOLUTION; DIVERGENCE TIMES; MUTATION; MODELS; SUBSTITUTION; RESIDUES;
D O I
10.1534/genetics.114.162263
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
We develop a maximum penalized-likelihood (MPL) method to estimate the fitnesses of amino acids and the distribution of selection coefficients (S = 2Ns) in protein-coding genes from phylogenetic data. This improves on a previous maximum-likelihood method. Various penalty functions are used to penalize extreme estimates of the fitnesses, thus correcting overfitting by the previous method. Using a combination of computer simulation and real data analysis, we evaluate the effect of the various penalties on the estimation of the fitnesses and the distribution of S. We show the new method regularizes the estimates of the fitnesses for small, relatively uninformative data sets, but it can still recover the large proportion of deleterious mutations when present in simulated data. Computer simulations indicate that as the number of taxa in the phylogeny or the level of sequence divergence increases, the distribution of S can be more accurately estimated. Furthermore, the strength of the penalty can be varied to study how informative a particular data set is about the distribution of S. We analyze three protein-coding genes (the chloroplast rubisco protein, mammal mitochondrial proteins, and an influenza virus polymerase) and show the new method recovers a large proportion of deleterious mutations in these data, even under strong penalties, confirming the distribution of S is bimodal in these real data. We recommend the use of the new MPL approach for the estimation of the distribution of S in species phylogenies of protein-coding genes.
引用
收藏
页码:257 / 271
页数:15
相关论文
共 43 条
[1]   Within- and between-species DNA sequence variation and the 'footprint' of natural selection [J].
Akashi, H .
GENE, 1999, 238 (01) :39-51
[2]   Mutational effects on stability are largely conserved during protein evolution [J].
Ashenberg, Orr ;
Gong, L. Ian ;
Bloom, Jesse D. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2013, 110 (52) :21071-21076
[3]   Analysis of catalytic residues in enzyme active sites [J].
Bartlett, GJ ;
Porter, CT ;
Borkakoti, N ;
Thornton, JM .
JOURNAL OF MOLECULAR BIOLOGY, 2002, 324 (01) :105-121
[4]   Measures of residue density in protein structures [J].
Baud, F ;
Karlin, S .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (22) :12494-12499
[5]   Influenza A Virus Polymerase: Structural Insights into Replication and Host Adaptation Mechanisms [J].
Boivin, Stephane ;
Cusack, Stephen ;
Ruigrok, Rob W. H. ;
Hart, Darren J. .
JOURNAL OF BIOLOGICAL CHEMISTRY, 2010, 285 (37) :28411-28417
[6]  
Bustamante CD, 2005, STAT BIOL HEALTH, P63, DOI 10.1007/0-387-27733-1_4
[7]   ASYMPTOTIC ANALYSIS OF PENALIZED LIKELIHOOD AND RELATED ESTIMATORS [J].
COX, DD ;
OSULLIVAN, F .
ANNALS OF STATISTICS, 1990, 18 (04) :1676-1695
[8]   The distribution of fitness effects of new mutations [J].
Eyre-Walker, Adam ;
Keightley, Peter D. .
NATURE REVIEWS GENETICS, 2007, 8 (08) :610-618
[9]  
Grossman S., 1995, ELEMENTARY LINEAR AL
[10]   Evolutionary distances for protein-coding sequences: Modeling site-specific residue frequencies [J].
Halpern, AL ;
Bruno, WJ .
MOLECULAR BIOLOGY AND EVOLUTION, 1998, 15 (07) :910-917