Improving Contact Prediction along Three Dimensions

被引:59
作者
Feinauer, Christoph [1 ,2 ]
Skwark, Marcin J. [3 ,4 ]
Pagnani, Andrea [1 ,2 ,5 ]
Aurell, Erik [3 ,4 ,6 ]
机构
[1] Politecn Torino, DISAT, Turin, Italy
[2] Politecn Torino, Ctr Computat Sci, Turin, Italy
[3] Aalto Univ, Dept Informat & Comp Sci, Aalto, Finland
[4] Aalto Univ, Aalto Sci Inst AScI, Aalto, Finland
[5] Human Genet Fdn Torino, Ctr Mol Biotechnol, Turin, Italy
[6] AlbaNova Univ Ctr, Royal Inst Technol, Dept Computat Biol, Stockholm, Sweden
基金
芬兰科学院;
关键词
DIRECT-COUPLING ANALYSIS; CORRELATED MUTATIONS; PROTEIN-STRUCTURE; SEQUENCE; CLASSIFICATION; COVARIANCE;
D O I
10.1371/journal.pcbi.1003847
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Correlation patterns in multiple sequence alignments of homologous proteins can be exploited to infer information on the three-dimensional structure of their members. The typical pipeline to address this task, which we in this paper refer to as the three dimensions of contact prediction, is to (i) filter and align the raw sequence data representing the evolutionarily related proteins; (ii) choose a predictive model to describe a sequence alignment; (iii) infer the model parameters and interpret them in terms of structural properties, such as an accurate contact map. We show here that all three dimensions are important for overall prediction success. In particular, we show that it is possible to improve significantly along the second dimension by going beyond the pair-wise Potts models from statistical physics, which have hitherto been the focus of the field. These (simple) extensions are motivated by multiple sequence alignments often containing long stretches of gaps which, as a data feature, would be rather untypical for independent samples drawn from a Potts model. Using a large test set of proteins we show that the combined improvements along the three dimensions are as large as any reported to date.
引用
收藏
页数:13
相关论文
共 44 条
[1]   CORRELATION OF COORDINATED AMINO-ACID SUBSTITUTIONS WITH FUNCTION IN VIRUSES RELATED TO TOBACCO MOSAIC-VIRUS [J].
ALTSCHUH, D ;
LESK, AM ;
BLOOMER, AC ;
KLUG, A .
JOURNAL OF MOLECULAR BIOLOGY, 1987, 193 (04) :693-707
[2]   Information geometry on hierarchy of probability distributions [J].
Amari, S .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2001, 47 (05) :1701-1711
[3]   PRINCIPLES THAT GOVERN FOLDING OF PROTEIN CHAINS [J].
ANFINSEN, CB .
SCIENCE, 1973, 181 (4096) :223-230
[4]  
[Anonymous], PREDICTION RESIDUE R
[5]  
[Anonymous], LECT NOTES MONOGRAPH
[6]  
[Anonymous], LARGE DEVIATIONS APP
[7]   Update on activities at the Universal Protein Resource (UniProt) in 2013 [J].
Apweiler, Rolf ;
Martin, Maria Jesus ;
O'Donovan, Claire ;
Magrane, Michele ;
Alam-Faruque, Yasmin ;
Alpi, Emanuela ;
Antunes, Ricardo ;
Arganiska, Joanna ;
Casanova, Elisabet Barrera ;
Bely, Benoit ;
Bingley, Mark ;
Bonilla, Carlos ;
Britto, Ramona ;
Bursteinas, Borisas ;
Chan, Wei Mun ;
Chavali, Gayatri ;
Cibrian-Uhalte, Elena ;
Da Silva, Alan ;
De Giorgi, Maurizio ;
Dimmer, Emily ;
Fazzini, Francesco ;
Gane, Paul ;
Fedotov, Alexander ;
Castro, Leyla Garcia ;
Garmiri, Penelope ;
Hatton-Ellis, Emma ;
Hieta, Reija ;
Huntley, Rachael ;
Jacobsen, Julius ;
Jones, Rachel ;
Legge, Duncan ;
Liu, Wudong ;
Luo, Jie ;
MacDougall, Alistair ;
Mutowo, Prudence ;
Nightingale, Andrew ;
Orchard, Sandra ;
Patient, Samuel ;
Pichler, Klemens ;
Poggioli, Diego ;
Pundir, Sangya ;
Pureza, Luis ;
Qi, Guoying ;
Rosanoff, Steven ;
Sawford, Tony ;
Sehra, Harminder ;
Turner, Edward ;
Volynkin, Vladimir ;
Wardell, Tony ;
Watkins, Xavier .
NUCLEIC ACIDS RESEARCH, 2013, 41 (D1) :D43-D47
[8]   Learning generative models for protein fold families [J].
Balakrishnan, Sivaraman ;
Kamisetty, Hetunandan ;
Carbonell, Jaime G. ;
Lee, Su-In ;
Langmead, Christopher James .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2011, 79 (04) :1061-1078
[9]  
Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkh121, 10.1093/nar/gkr1065]
[10]   STATISTICAL-ANALYSIS OF NON-LATTICE DATA [J].
BESAG, J .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES D-THE STATISTICIAN, 1975, 24 (03) :179-195