General continuous-time Markov model of sequence evolution via insertions/deletions: are alignment probabilities factorable?

被引:5
作者
Ezawa, Kiyoshi [1 ,2 ]
机构
[1] Kyushu Inst Technol, Dept Biosci & Bioinformat, Iizuka, Fukuoka 8208502, Japan
[2] Univ Houston, Dept Biol & Biochem, Houston, TX 77204 USA
关键词
Stochastic evolutionary model; Insertion/deletion (indel); Sequence alignment probability; Factorability; Biological realism; Power-law distribution; Rate variation; Non-equilibrium evolution; MAXIMUM-LIKELIHOOD ALIGNMENT; JOINT BAYESIAN-ESTIMATION; FRAMEWORK; UNCERTAINTY; DIVERGENCE; CHIMPANZEE; PHYLOGENY; INFERENCE; DELETIONS; SAMPLES;
D O I
10.1186/s12859-016-1105-7
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Insertions and deletions (indels) account for more nucleotide differences between two related DNA sequences than substitutions do, and thus it is imperative to develop a stochastic evolutionary model that enables us to reliably calculate the probability of the sequence evolution through indel processes. Recently, indel probabilistic models are mostly based on either hidden Markov models (HMMs) or transducer theories, both of which give the indel component of the probability of a given sequence alignment as a product of either probabilities of column-to-column transitions or block-wise contributions along the alignment However, it is not a priori clear how these models are related with any genuine stochastic evolutionary model, which describes the stochastic evolution of an entire sequence along the time-axis. Moreover, currently none of these models can fully accommodate biologically realistic features, such as overlapping indels, power-law indel-length distributions, and indel rate variation across regions. Results: Here, we theoretically dissect the ab initio calculation of the probability of a given sequence alignment under a genuine stochastic evolutionary model, more specifically, a general continuous-time Markov model of the evolution of an entire sequence via insertions and deletions. Our model is a simple extension of the general "substitution/insertion/deletion (SID) model". Using the operator representation of indels and the technique of time-dependent perturbation theory, we express the ab initio probability as a summation over all alignment-consistent indel histories. Exploiting the equivalence relations between different indel histories, we find a "sufficient and nearly necessary" set of conditions under which the probability can be factorized into the product of an overall factor and the contributions from regions separated by gapless columns of the alignment thus providing a sort of generalized HMM. The conditions distinguish evolutionary models with factorable alignment probabilities from those without ones. The former category includes the "long indel" model (a space-homogeneous SID model) and the model used by Dawg, a genuine sequence evolution simulator. Conclusions: With intuitive clarity and mathematical preciseness, our theoretical formulation will help further advance the ab initio calculation of alignment probabilities under biologically realistic models of sequence evolution via indels.
引用
收藏
页数:25
相关论文
共 51 条
[1]  
[Anonymous], 2007, The Origins of Genome Architecture
[2]   MAXIMUM-LIKELIHOOD ALIGNMENT OF DNA-SEQUENCES [J].
BISHOP, MJ ;
THOMPSON, EA .
JOURNAL OF MOLECULAR BIOLOGY, 1986, 190 (02) :159-165
[3]   A Note on Probabilistic Models over Strings: The Linear Algebra Approach [J].
Bouchard-Cote, Alexandre .
BULLETIN OF MATHEMATICAL BIOLOGY, 2013, 75 (12) :2529-2550
[4]   Transducers: an emerging probabilistic framework for modeling indels on trees [J].
Bradley, Robert K. ;
Holmes, Ian .
BIOINFORMATICS, 2007, 23 (23) :3258-3262
[5]   Majority of divergence between closely related DNA samples is due to indels [J].
Britten, RJ ;
Rowen, L ;
Williams, J ;
Cameron, RA .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2003, 100 (08) :4661-4665
[6]   Divergence between samples of chimpanzee and human DNA sequences is 5%, counting indels [J].
Britten, RJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (21) :13633-13635
[7]   DNA assembly with gaps (Dawg): simulating sequence evolution [J].
Cartwright, RA .
BIOINFORMATICS, 2005, 21 :31-38
[8]   Problems and Solutions for Estimating Indel Rates and Length Distributions [J].
Cartwright, Reed A. .
MOLECULAR BIOLOGY AND EVOLUTION, 2009, 26 (02) :473-480
[9]  
Chindelevitch Leonid, 2006, Journal of Bioinformatics and Computational Biology, V4, P721, DOI 10.1142/S0219720006002168
[10]   Exact and heuristic algorithms for the Indel Maximum Likelihood Problem [J].
Diallo, Abdoulaye Banire ;
Makarenkov, Vladimir ;
Blanchette, Mathieu .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2007, 14 (04) :446-461