Protein language models learn evolutionary statistics of interacting sequence motifs

被引:0
作者
Zhang, Zhidian [1 ,2 ,3 ]
Wayment-Steele, Hannah K. [4 ,5 ]
Brixi, Garyk [6 ]
Wang, Haobo [1 ,5 ]
Kern, Dorothee [4 ]
Ovchinnikov, Sergey [2 ,7 ]
机构
[1] Harvard Univ, Cambridge, MA 02138 USA
[2] MIT, Dept Biol, Cambridge, MA 02139 USA
[3] Ecole Polytech Fed Lausanne, Inst Bioengn, Sch Life Sci, CH-1015 Lausanne, Switzerland
[4] Brandeis Univ, HHMI, Waltham, MA 02453 USA
[5] Brandeis Univ, Dept Biochem, Waltham, MA 02453 USA
[6] Harvard Univ, Harvard Coll, Cambridge, MA 02138 USA
[7] Harvard Univ, John Harvard Distinguished Sci Fellowship, Cambridge, MA 02138 USA
关键词
language models; interpretability study; protein structure prediction; RECOGNITION; DESIGN;
D O I
10.1073/pnas.2406285121/-/DCSupplemental
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Protein language models (pLMs) have emerged as potent tools for predicting and designing protein structure and function, and the degree to which these models fundamentally understand the inherent biophysics of protein structure stands as an open question. Motivated by a finding that pLM-based structure predictors erroneously predict nonphysical structures for protein isoforms, we investigated the nature of sequence context needed for contact predictions in the pLM Evolutionary Scale Modeling (ESM-2). We demonstrate by use of a "categorical Jacobian" calculation that ESM-2 stores statistics of coevolving residues, analogously to simpler modeling approaches like Markov Random Fields and Multivariate Gaussian models. We further investigated how ESM-2 "stores" information needed to predict contacts by comparing sequence masking strategies, and found that providing local windows of sequence information allowed ESM-2 to best recover predicted contacts. This suggests that pLMs predict contacts by storing motifs of pairwise contacts. Our investigation highlights the limitations of current pLMs and underscores the importance of understanding the underlying mechanisms of these models.
引用
收藏
页数:9
相关论文
共 39 条
  • [11] Dauparas J, 2019, Arxiv, DOI arXiv:1906.02598
  • [12] Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction
    Dunn, S. D.
    Wahl, L. M.
    Gloor, G. B.
    [J]. BIOINFORMATICS, 2008, 24 (03) : 333 - 340
  • [13] Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models
    Ekeberg, Magnus
    Lovkvist, Cecilia
    Lan, Yueheng
    Weigt, Martin
    Aurell, Erik
    [J]. PHYSICAL REVIEW E, 2013, 87 (01)
  • [14] Design of proteins from smaller fragments - learning from evolution
    Hoecker, Birte
    [J]. CURRENT OPINION IN STRUCTURAL BIOLOGY, 2014, 27 : 56 - 62
  • [15] Highly accurate protein structure prediction with AlphaFold
    Jumper, John
    Evans, Richard
    Pritzel, Alexander
    Green, Tim
    Figurnov, Michael
    Ronneberger, Olaf
    Tunyasuvunakool, Kathryn
    Bates, Russ
    Zidek, Augustin
    Potapenko, Anna
    Bridgland, Alex
    Meyer, Clemens
    Kohl, Simon A. A.
    Ballard, Andrew J.
    Cowie, Andrew
    Romera-Paredes, Bernardino
    Nikolov, Stanislav
    Jain, Rishub
    Adler, Jonas
    Back, Trevor
    Petersen, Stig
    Reiman, David
    Clancy, Ellen
    Zielinski, Michal
    Steinegger, Martin
    Pacholska, Michalina
    Berghammer, Tamas
    Bodenstein, Sebastian
    Silver, David
    Vinyals, Oriol
    Senior, Andrew W.
    Kavukcuoglu, Koray
    Kohli, Pushmeet
    Hassabis, Demis
    [J]. NATURE, 2021, 596 (7873) : 583 - +
  • [16] DICTIONARY OF PROTEIN SECONDARY STRUCTURE - PATTERN-RECOGNITION OF HYDROGEN-BONDED AND GEOMETRICAL FEATURES
    KABSCH, W
    SANDER, C
    [J]. BIOPOLYMERS, 1983, 22 (12) : 2577 - 2637
  • [17] Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era
    Kamisetty, Hetunandan
    Ovchinnikov, Sergey
    Baker, David
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2013, 110 (39) : 15674 - 15679
  • [18] Starch catabolism by a prominent human gut symbiont is directed by the recognition of amylose helices
    Koropatkin, Nicole M.
    Martens, Eric C.
    Gordon, Jeffrey I.
    Smith, Thomas J.
    [J]. STRUCTURE, 2008, 16 (07) : 1105 - 1115
  • [19] The impact of splicing on protein domain architecture
    Light, Sara
    Elofsson, Arne
    [J]. CURRENT OPINION IN STRUCTURAL BIOLOGY, 2013, 23 (03) : 451 - 458
  • [20] Evolutionary-scale prediction of atomic-level protein structure with a language model
    Lin, Zeming
    Akin, Halil
    Rao, Roshan
    Hie, Brian
    Zhu, Zhongkai
    Lu, Wenting
    Smetanin, Nikita
    Verkuil, Robert
    Kabeli, Ori
    Shmueli, Yaniv
    Costa, Allan dos Santos
    Fazel-Zarandi, Maryam
    Sercu, Tom
    Candido, Salvatore
    Rives, Alexander
    [J]. SCIENCE, 2023, 379 (6637) : 1123 - 1130