Genome-wide nucleotide-level mammalian ancestor reconstruction

被引:130
作者
Paten, Benedict [1 ]
Herrero, Javier [2 ]
Fitzgerald, Stephen [2 ]
Beal, Kathryn [2 ]
Flicek, Paul [2 ]
Holmes, Ian [3 ]
Birney, Ewan [2 ]
机构
[1] Univ Calif Santa Cruz, Ctr Biomol Sci & Engn, Santa Cruz, CA 95064 USA
[2] EMBL European Bioinformat Inst, Cambridge CB10 1SD, England
[3] Univ Calif Berkeley, Dept Bioengn, Berkeley, CA 94720 USA
关键词
D O I
10.1101/gr.076521.108
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Recently attention has been turned to the problem of reconstructing complete ancestral sequences from large multiple alignments. Successful generation of these genome-wide reconstructions will facilitate a greater knowledge of the events that have driven evolution. We present a new evolutionary alignment modeler, called "Ortheus," for inferring the evolutionary history of a multiple alignment, in terms of both substitutions and, importantly, insertions and deletions. Based on a multiple sequence probabilistic transducer model of the type proposed by Holmes, Ortheus uses efficient stochastic graph-based dynamic programming methods. Unlike other methods, Ortheus does not rely on a single fixed alignment from which to work. Ortheus is also more scaleable than previous methods while being fast, stable, and open source. Large-scale simulations show that Ortheus performs close to optimally on a deep mammalian phylogeny. Simulations also indicate that significant proportions of errors due to insertions and deletions can be avoided by not assuming a fixed alignment. We additionally use a challenging hold-out cross-validation procedure to test the method; using the reconstructions to predict extant sequence bases, we demonstrate significant improvements over using closest extant neighbor sequences. Accompanying this paper, a new, public, and genome-wide set of Ortheus ancestor alignments provide an intriguing new resource for evolutionary studies in mammals. As a first piece of analysis, we attempt to recover "fossilized" ancestral pseudogenes. We confidently find 31 cases in which the ancestral sequence had a more complete sequence than any of the extant sequences.
引用
收藏
页码:1829 / 1843
页数:15
相关论文
共 55 条
[11]   CONSTRAINED SEQUENCE ALIGNMENT [J].
CHAO, KM ;
HARDISON, RC ;
MILLER, W .
BULLETIN OF MATHEMATICAL BIOLOGY, 1993, 55 (03) :503-524
[12]  
Chindelevitch Leonid, 2006, Journal of Bioinformatics and Computational Biology, V4, P721, DOI 10.1142/S0219720006002168
[13]   Distribution and intensity of constraint in mammalian genomic sequence [J].
Cooper, GM ;
Stone, EA ;
Asimenos, G ;
Green, ED ;
Batzoglou, S ;
Sidow, A .
GENOME RESEARCH, 2005, 15 (07) :901-913
[14]   Characterization of evolutionary rates and constraints in three mammalian genomes [J].
Cooper, GM ;
Brudno, M ;
Stone, EA ;
Dubchak, I ;
Batzoglou, S ;
Sidow, A .
GENOME RESEARCH, 2004, 14 (04) :539-548
[15]   Mauve: Multiple alignment of conserved genomic sequence with rearrangements [J].
Darling, ACE ;
Mau, B ;
Blattner, FR ;
Perna, NT .
GENOME RESEARCH, 2004, 14 (07) :1394-1403
[16]   Exact and heuristic algorithms for the Indel Maximum Likelihood Problem [J].
Diallo, Abdoulaye Banire ;
Makarenkov, Vladimir ;
Blanchette, Mathieu .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2007, 14 (04) :446-461
[17]   ProbCons: Probabilistic consistency-based multiple sequence alignment [J].
Do, CB ;
Mahabhashyam, MSP ;
Brudno, M ;
Batzoglou, S .
GENOME RESEARCH, 2005, 15 (02) :330-340
[18]  
Durbin R., 1998, BIOL SEQUENCE ANAL
[19]   MUSCLE: a multiple sequence alignment method with reduced time and space complexity [J].
Edgar, RC .
BMC BIOINFORMATICS, 2004, 5 (1) :1-19
[20]  
Felsenstein Joseph, 2004, Inferring_phylogenies, V2