Unsupervised genome-wide recognition of local relationship patterns

被引:44
作者
Zamani, Neda [1 ]
Russell, Pamela [2 ]
Lantz, Henrik [1 ]
Hoeppner, Marc P. [1 ]
Meadows, Jennifer R. S. [1 ]
Vijay, Nagarjun [3 ]
Mauceli, Evan [4 ]
di Palma, Federica [2 ]
Lindblad-Toh, Kerstin [1 ,2 ]
Jern, Patric [1 ]
Grabherr, Manfred G. [1 ,2 ]
机构
[1] Uppsala Univ, Dept Med Biochem & Microbiol, Sci Life Lab, Uppsala, Sweden
[2] Broad Inst MIT & Harvard, Cambridge, MA USA
[3] Uppsala Univ, Dept Ecol & Genet, Evolutionary Biol Ctr, Uppsala, Sweden
[4] Boston Childrens Hosp, Boston, MA USA
基金
瑞典研究理事会;
关键词
SELF-ORGANIZING MAP; PHYLOGENETIC ANALYSIS; MAXIMUM-LIKELIHOOD; SEQUENCE; EVOLUTION; INFERENCE; VIRUS;
D O I
10.1186/1471-2164-14-347
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: Phenomena such as incomplete lineage sorting, horizontal gene transfer, gene duplication and subsequent sub- and neo-functionalisation can result in distinct local phylogenetic relationships that are discordant with species phylogeny. In order to assess the possible biological roles for these subdivisions, they must first be identified and characterised, preferably on a large scale and in an automated fashion. Results: We developed Saguaro, a combination of a Hidden Markov Model (HMM) and a Self Organising Map (SOM), to characterise local phylogenetic relationships among aligned sequences using cacti, matrices of pair-wise distance measures. While the HMM determines the genomic boundaries from aligned sequences, the SOM hypothesises new cacti in an unsupervised and iterative fashion based on the regions that were modelled least well by existing cacti. After testing the software on simulated data, we demonstrate the utility of Saguaro by testing two different data sets: (i) 181 Dengue virus strains, and (ii) 5 primate genomes. Saguaro identifies regions under lineage-specific constraint for the first set, and genomic segments that we attribute to incomplete lineage sorting in the second dataset. Intriguingly for the primate data, Saguaro also classified an additional similar to 3% of the genome as most incompatible with the expected species phylogeny. A substantial fraction of these regions was found to overlap genes associated with both the innate and adaptive immune systems. Conclusions: Saguaro detects distinct cacti describing local phylogenetic relationships without requiring any a priori hypotheses. We have successfully demonstrated Saguaro's utility with two contrasting data sets, one containing many members with short sequences (Dengue viral strains: n = 181, genome size = 10,700 nt), and the other with few members but complex genomes (related primate species: n = 5, genome size = 3 Gb), suggesting that the software is applicable to a wide variety of experimental populations. Saguaro is written in C++, runs on the Linux operating system, and can be downloaded from http://saguarogw.sourceforge.net/.
引用
收藏
页数:11
相关论文
共 30 条
[1]   STATISTICAL INFERENCE FOR PROBABILISTIC FUNCTIONS OF FINITE STATE MARKOV CHAINS [J].
BAUM, LE ;
PETRIE, T .
ANNALS OF MATHEMATICAL STATISTICS, 1966, 37 (06) :1554-&
[2]  
Cartwright RA, 2005, BIOINF OXFORD ENGLAN, V21, P8
[3]   Control of translation by the 5′- and 3′-terminal regions of the dengue virus genome [J].
Chiu, WW ;
Kinney, RM ;
Dreher, TW .
JOURNAL OF VIROLOGY, 2005, 79 (13) :8303-8315
[4]   Phylogeny.fr: robust phylogenetic analysis for the non-specialist [J].
Dereeper, A. ;
Guignon, V. ;
Blanc, G. ;
Audic, S. ;
Buffet, S. ;
Chevenet, F. ;
Dufayard, J. -F. ;
Guindon, S. ;
Lefort, V. ;
Lescot, M. ;
Claverie, J. -M. ;
Gascuel, O. .
NUCLEIC ACIDS RESEARCH, 2008, 36 :W465-W469
[5]   The role of hybridization and introgression in the diversification of animals [J].
Dowling, TE ;
Secor, CL .
ANNUAL REVIEW OF ECOLOGY AND SYSTEMATICS, 1997, 28 :593-619
[6]   MUSCLE: multiple sequence alignment with high accuracy and high throughput [J].
Edgar, RC .
NUCLEIC ACIDS RESEARCH, 2004, 32 (05) :1792-1797
[7]  
Garber M., 2009, Bioinformatics, V25, pi54
[8]  
Gorban A.N., 2007, Principal Manifolds for Data Visualization and Dimension Reduction
[9]   The DNA sequence and biology of human chromosome 19 [J].
Grimwood, J ;
Gordon, LA ;
Olsen, A ;
Terry, A ;
Schmutz, J ;
Lamerdin, J ;
Hellsten, U ;
Goodstein, D ;
Couronne, O ;
Gyamfi, MT ;
Aerts, A ;
Altherr, M ;
Ashworth, L ;
Bajorek, E ;
Black, S ;
Branscomb, E ;
Caenepeel, S ;
Carrano, A ;
Caoile, C ;
Chan, YM ;
Christensen, M ;
Cleland, CA ;
Copeland, A ;
Dalin, E ;
Dehal, P ;
Denys, M ;
Detter, JC ;
Escobar, J ;
Flowers, D ;
Fotopulos, D ;
Garcia, C ;
Georgescu, AM ;
Glavina, T ;
Gomez, M ;
Gonzales, E ;
Groza, M ;
Hammon, N ;
Hawkins, T ;
Haydu, L ;
Ho, I ;
Huang, W ;
Israni, S ;
Jett, J ;
Kadner, K ;
Kimball, H ;
Kobayashi, A ;
Larionov, V ;
Leem, SH ;
Lopez, F ;
Lou, YL .
NATURE, 2004, 428 (6982) :529-535
[10]   New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0 [J].
Guindon, Stephane ;
Dufayard, Jean-Francois ;
Lefort, Vincent ;
Anisimova, Maria ;
Hordijk, Wim ;
Gascuel, Olivier .
SYSTEMATIC BIOLOGY, 2010, 59 (03) :307-321