Genome-wide nucleotide-level mammalian ancestor reconstruction

被引:130
作者
Paten, Benedict [1 ]
Herrero, Javier [2 ]
Fitzgerald, Stephen [2 ]
Beal, Kathryn [2 ]
Flicek, Paul [2 ]
Holmes, Ian [3 ]
Birney, Ewan [2 ]
机构
[1] Univ Calif Santa Cruz, Ctr Biomol Sci & Engn, Santa Cruz, CA 95064 USA
[2] EMBL European Bioinformat Inst, Cambridge CB10 1SD, England
[3] Univ Calif Berkeley, Dept Bioengn, Berkeley, CA 94720 USA
关键词
D O I
10.1101/gr.076521.108
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Recently attention has been turned to the problem of reconstructing complete ancestral sequences from large multiple alignments. Successful generation of these genome-wide reconstructions will facilitate a greater knowledge of the events that have driven evolution. We present a new evolutionary alignment modeler, called "Ortheus," for inferring the evolutionary history of a multiple alignment, in terms of both substitutions and, importantly, insertions and deletions. Based on a multiple sequence probabilistic transducer model of the type proposed by Holmes, Ortheus uses efficient stochastic graph-based dynamic programming methods. Unlike other methods, Ortheus does not rely on a single fixed alignment from which to work. Ortheus is also more scaleable than previous methods while being fast, stable, and open source. Large-scale simulations show that Ortheus performs close to optimally on a deep mammalian phylogeny. Simulations also indicate that significant proportions of errors due to insertions and deletions can be avoided by not assuming a fixed alignment. We additionally use a challenging hold-out cross-validation procedure to test the method; using the reconstructions to predict extant sequence bases, we demonstrate significant improvements over using closest extant neighbor sequences. Accompanying this paper, a new, public, and genome-wide set of Ortheus ancestor alignments provide an intriguing new resource for evolutionary studies in mammals. As a first piece of analysis, we attempt to recover "fossilized" ancestral pseudogenes. We confidently find 31 cases in which the ancestral sequence had a more complete sequence than any of the extant sequences.
引用
收藏
页码:1829 / 1843
页数:15
相关论文
共 55 条
[1]   SLAM: Cross-species gene finding and alignment with a generalized pair hidden Markov model [J].
Alexandersson, M ;
Cawley, S ;
Pachter, L .
GENOME RESEARCH, 2003, 13 (03) :496-502
[2]  
[Anonymous], MARKOV CHAIN MONTE C
[3]   GeneWise and genomewise [J].
Birney, E ;
Clamp, M ;
Durbin, R .
GENOME RESEARCH, 2004, 14 (05) :988-995
[4]   Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project [J].
Birney, Ewan ;
Stamatoyannopoulos, John A. ;
Dutta, Anindya ;
Guigo, Roderic ;
Gingeras, Thomas R. ;
Margulies, Elliott H. ;
Weng, Zhiping ;
Snyder, Michael ;
Dermitzakis, Emmanouil T. ;
Stamatoyannopoulos, John A. ;
Thurman, Robert E. ;
Kuehn, Michael S. ;
Taylor, Christopher M. ;
Neph, Shane ;
Koch, Christoph M. ;
Asthana, Saurabh ;
Malhotra, Ankit ;
Adzhubei, Ivan ;
Greenbaum, Jason A. ;
Andrews, Robert M. ;
Flicek, Paul ;
Boyle, Patrick J. ;
Cao, Hua ;
Carter, Nigel P. ;
Clelland, Gayle K. ;
Davis, Sean ;
Day, Nathan ;
Dhami, Pawandeep ;
Dillon, Shane C. ;
Dorschner, Michael O. ;
Fiegler, Heike ;
Giresi, Paul G. ;
Goldy, Jeff ;
Hawrylycz, Michael ;
Haydock, Andrew ;
Humbert, Richard ;
James, Keith D. ;
Johnson, Brett E. ;
Johnson, Ericka M. ;
Frum, Tristan T. ;
Rosenzweig, Elizabeth R. ;
Karnani, Neerja ;
Lee, Kirsten ;
Lefebvre, Gregory C. ;
Navas, Patrick A. ;
Neri, Fidencio ;
Parker, Stephen C. J. ;
Sabo, Peter J. ;
Sandstrom, Richard ;
Shafer, Anthony .
NATURE, 2007, 447 (7146) :799-816
[5]   Reconstructing large regions of an ancestral mammalian genome in silico [J].
Blanchette, M ;
Green, ED ;
Miller, W ;
Haussler, D .
GENOME RESEARCH, 2004, 14 (12) :2412-2423
[6]   Aligning multiple genomic sequences with the threaded blockset aligner [J].
Blanchette, M ;
Kent, WJ ;
Riemer, C ;
Elnitski, L ;
Smit, AFA ;
Roskin, KM ;
Baertsch, R ;
Rosenbloom, K ;
Clawson, H ;
Green, ED ;
Haussler, D ;
Miller, W .
GENOME RESEARCH, 2004, 14 (04) :708-715
[7]   Transducers: an emerging probabilistic framework for modeling indels on trees [J].
Bradley, Robert K. ;
Holmes, Ian .
BIOINFORMATICS, 2007, 23 (23) :3258-3262
[8]   MAVID: Constrained ancestral alignment of multiple sequences [J].
Bray, N ;
Pachter, L .
GENOME RESEARCH, 2004, 14 (04) :693-699
[9]   LAGAN and Multi-LAGAN: Efficient tools for large-scale multiple alignment of genomic DNA [J].
Brudno, M ;
Do, CB ;
Cooper, GM ;
Kim, MF ;
Davydov, E ;
Green, ED ;
Sidow, A ;
Batzoglou, S .
GENOME RESEARCH, 2003, 13 (04) :721-731
[10]   Glocal alignment: finding rearrangements during alignment [J].
Brudno, Michael ;
Malde, Sanket ;
Poliakov, Alexander ;
Do, Chuong B. ;
Couronne, Olivier ;
Dubchak, Inna ;
Batzoglou, Serafim .
BIOINFORMATICS, 2003, 19 :i54-i62