K-Pax2: Bayesian identification of cluster-defining amino acid positions in large sequence datasets

被引:12
作者
Pessia, Alberto [1 ]
Grad, Yonatan [2 ,3 ]
Cobey, Sarah [4 ]
Puranen, Juha Santeri [5 ]
Corander, Jukka [1 ]
机构
[1] Univ Helsinki, Dept Math & Stat, Helsinki, Finland
[2] Harvard TH Chan Sch Publ Hlth, Dept Immunol & Infect Dis, Boston, MA USA
[3] Harvard Med Sch, Brigham & Womens Hosp, Div Infect Dis, Dept Med, Boston, MA USA
[4] Univ Chicago, Dept Ecol & Evolut, Chicago, IL 60637 USA
[5] Abo Akad Univ, Dept Biosci, Turku, Finland
关键词
data clustering; protein evolution; sequence analysis;
D O I
10.1099/mgen.0.000025
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
The recent growth In publicly available sequence data has Introduced new opportunities for studying microbial evolution and spread. Because the pace of sequence accumulation tends to exceed the pace of experimental studies of protein function and the roles of individual amino acids , statistical tools to identify meaningful patterns In protein diversity are essential. Large sequence alignments from fast-evolving micro-organisms are particularly challenging to dissect using standard tools from phylogenetics and multivariate statistics because biologically relevant functional signals are easily masked by neutral variation and noise. To meet this need, a novel computational method Is Introduced that Is easily executed In parallel using a cluster environment and can handle thousands of sequences with minimal subjective Input from the user. The usefulness of this kind of machine learning Is demonstrated by applying It to nearly 5000 haemagglutinin sequences of Influenza A/H3N2.AntlgenIc and 3D structural mapping of the results show that the method can recover the major jumps In antigenic phenotype that occurred between 1968 and 2013 and identify specific amino acids associated with these changes. The method Is expected to provide a useful tool to uncover patterns of protein evolution.
引用
收藏
页数:11
相关论文
共 35 条
[1]   Feature Selection Methods for Identifying Genetic Determinants of Host Species in RNA Viruses [J].
Aguas, Ricardo ;
Ferguson, Neil M. .
PLOS COMPUTATIONAL BIOLOGY, 2013, 9 (10)
[2]   The influenza virus resource at the national center for biotechnology information [J].
Bao, Yiming ;
Bolotov, Pavel ;
Dernovoy, Dmitry ;
Kiryutin, Boris ;
Zaslavsky, Leonid ;
Tatusova, Tatiana ;
Ostell, Jim ;
Lipman, David .
JOURNAL OF VIROLOGY, 2008, 82 (02) :596-601
[3]   Integrating influenza antigenic dynamics with molecular evolution [J].
Bedford, Trevor ;
Suchard, Marc A. ;
Lemey, Philippe ;
Dudas, Gytis ;
Gregory, Victoria ;
Hay, Alan J. ;
McCauley, John W. ;
Russell, Colin A. ;
Smith, Derek J. ;
Rambaut, Andrew .
ELIFE, 2014, 3
[4]  
Benson DA, 2005, NUCLEIC ACIDS RES, V33, pD34, DOI [10.1093/nar/gki063, 10.1093/nar/gku1216]
[5]  
Bernardo J.M.., 2000, Bayesian Theory
[6]   STRUCTURE OF INFLUENZA-VIRUS HEMAGGLUTININ COMPLEXED WITH A NEUTRALIZING ANTIBODY [J].
BIZEBARD, T ;
GIGANT, B ;
RIGOLET, P ;
RASMUSSEN, B ;
DIAT, O ;
BOSECKE, P ;
WHARTON, SA ;
SKEHEL, JJ ;
KNOSSOW, M .
NATURE, 1995, 376 (6535) :92-94
[7]   A global initiative on sharing avian flu data [J].
Bogner, Peter ;
Capua, Ilaria ;
Cox, Nancy J. ;
Lipman, David J. .
NATURE, 2006, 442 (7106) :981-981
[8]   Hierarchical and Spatially Explicit Clustering of DNA Sequences with BAPS Software [J].
Cheng, Lu ;
Connor, Thomas R. ;
Siren, Jukka ;
Aanensen, David M. ;
Corander, Jukka .
MOLECULAR BIOLOGY AND EVOLUTION, 2013, 30 (05) :1224-1228
[9]   Transmission and evolution of the Middle East respiratory syndrome coronavirus in Saudi Arabia: a descriptive genomic study [J].
Cotten, Matthew ;
Watson, Simon J. ;
Kellam, Paul ;
Al-Rabeeah, Abdullah A. ;
Makhdoom, Hatem Q. ;
Assiri, Abdullah ;
Al-Tawfiq, Jaffar A. ;
Alhakeem, Rafat F. ;
Madani, Hossam ;
AlRabiah, Fahad A. ;
Al Hajjar, Sami ;
Al-Nassir, Wafa N. ;
Albarrak, Ali ;
Flemban, Hesham ;
Balkhy, Hanan H. ;
Alsubaie, Sarah ;
Palser, Anne L. ;
Gall, Astrid ;
Bashford-Rogers, Rachael ;
Rambaut, Andrew ;
Zumla, Alimuddin I. ;
Memish, Ziad A. .
LANCET, 2013, 382 (9909) :1993-2002
[10]   MUSCLE: multiple sequence alignment with high accuracy and high throughput [J].
Edgar, RC .
NUCLEIC ACIDS RESEARCH, 2004, 32 (05) :1792-1797