Protein language models learn evolutionary statistics of interacting sequence motifs

被引:0
作者
Zhang, Zhidian [1 ,2 ,3 ]
Wayment-Steele, Hannah K. [4 ,5 ]
Brixi, Garyk [6 ]
Wang, Haobo [1 ,5 ]
Kern, Dorothee [4 ]
Ovchinnikov, Sergey [2 ,7 ]
机构
[1] Harvard Univ, Cambridge, MA 02138 USA
[2] MIT, Dept Biol, Cambridge, MA 02139 USA
[3] Ecole Polytech Fed Lausanne, Inst Bioengn, Sch Life Sci, CH-1015 Lausanne, Switzerland
[4] Brandeis Univ, HHMI, Waltham, MA 02453 USA
[5] Brandeis Univ, Dept Biochem, Waltham, MA 02453 USA
[6] Harvard Univ, Harvard Coll, Cambridge, MA 02138 USA
[7] Harvard Univ, John Harvard Distinguished Sci Fellowship, Cambridge, MA 02138 USA
关键词
language models; interpretability study; protein structure prediction; RECOGNITION; DESIGN;
D O I
10.1073/pnas.2406285121/-/DCSupplemental
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Protein language models (pLMs) have emerged as potent tools for predicting and designing protein structure and function, and the degree to which these models fundamentally understand the inherent biophysics of protein structure stands as an open question. Motivated by a finding that pLM-based structure predictors erroneously predict nonphysical structures for protein isoforms, we investigated the nature of sequence context needed for contact predictions in the pLM Evolutionary Scale Modeling (ESM-2). We demonstrate by use of a "categorical Jacobian" calculation that ESM-2 stores statistics of coevolving residues, analogously to simpler modeling approaches like Markov Random Fields and Multivariate Gaussian models. We further investigated how ESM-2 "stores" information needed to predict contacts by comparing sequence masking strategies, and found that providing local windows of sequence information allowed ESM-2 to best recover predicted contacts. This suggests that pLMs predict contacts by storing motifs of pairwise contacts. Our investigation highlights the limitations of current pLMs and underscores the importance of understanding the underlying mechanisms of these models.
引用
收藏
页数:9
相关论文
共 39 条
  • [1] A vocabulary of ancient peptides at the origin of folded proteins
    Alva, Vikram
    Soeding, Johannes
    Lupas, Andrei N.
    [J]. ELIFE, 2015, 4
  • [2] Learning generative models for protein fold families
    Balakrishnan, Sivaraman
    Kamisetty, Hetunandan
    Carbonell, Jaime G.
    Lee, Su-In
    Langmead, Christopher James
    [J]. PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2011, 79 (04) : 1061 - 1078
  • [3] Fast and Accurate Multivariate Gaussian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners
    Baldassi, Carlo
    Zamparo, Marco
    Feinauer, Christoph
    Procaccini, Andrea
    Zecchina, Riccardo
    Weigt, Martin
    Pagnani, Andrea
    [J]. PLOS ONE, 2014, 9 (03):
  • [4] Bhattacharya N, 2022, BIOCOMPUT-PAC SYM, P34
  • [5] Alternative splicing and protein structure evolution
    Birzele, Fabian
    Csaba, Gergely
    Zimmer, Ralf
    [J]. NUCLEIC ACIDS RESEARCH, 2008, 36 (02) : 550 - 558
  • [6] Alternative splicing and genome complexity
    Brett, D
    Pospisil, H
    Valcárcel, J
    Reich, J
    Bork, P
    [J]. NATURE GENETICS, 2002, 30 (01) : 29 - 30
  • [7] Brixi G., ESM position offset
  • [8] Design of protein-binding proteins from the target structure alone
    Cao, Longxing
    Coventry, Brian
    Goreshnik, Inna
    Huang, Buwei
    Sheffler, William
    Park, Joon Sung
    Jude, Kevin M.
    Markovic, Iva
    Kadam, Rameshwar U.
    Verschueren, Koen H. G.
    Verstraete, Kenneth
    Walsh, Scott Thomas Russell
    Bennett, Nathaniel
    Phal, Ashish
    Yang, Aerin
    Kozodoy, Lisa
    DeWitt, Michelle
    Picton, Lora
    Miller, Lauren
    Strauch, Eva-Maria
    DeBouver, Nicholas D.
    Pires, Allison
    Bera, Asim K.
    Halabiya, Samer
    Hammerson, Bradley
    Yang, Wei
    Bernard, Steffen
    Stewart, Lance
    Wilson, Ian A.
    Ruohola-Baker, Hannele
    Schlessinger, Joseph
    Lee, Sangwon
    Savvides, Savvas N.
    Garcia, K. Christopher
    Baker, David
    [J]. NATURE, 2022, 605 (7910) : 551 - +
  • [9] Design of therapeutic proteins with enhanced stability
    Chennamsetty, Naresh
    Voynov, Vladimir
    Kayser, Veysel
    Helk, Bernhard
    Trout, Bernhardt L.
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2009, 106 (29) : 11937 - 11942
  • [10] Chowdhury R., 2021, bioRxiv, DOI DOI 10.1101/2021.08.02.454840