OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

被引:56
作者
Ahdritz, Gustaf [1 ,2 ]
Bouatta, Nazim [3 ]
Floristean, Christina [1 ]
Kadyan, Sachin [1 ]
Xia, Qinghui [1 ]
Gerecke, William [3 ]
O'Donnell, Timothy J. [4 ]
Berenberg, Daniel [5 ]
Fisk, Ian [6 ]
Zanichelli, Niccolo [7 ]
Zhang, Bo [8 ]
Nowaczynski, Arkadiusz [9 ]
Wang, Bei [9 ]
Stepniewska-Dziubinska, Marta M. [9 ]
Zhang, Shang [9 ]
Ojewole, Adegoke [9 ]
Guney, Murat Efe [9 ]
Biderman, Stella [10 ,11 ]
Watkins, Andrew M. [12 ]
Ra, Stephen [12 ]
Lorenzo, Pablo Ribalta [9 ]
Nivon, Lucas [13 ]
Weitzner, Brian [14 ]
Ban, Yih-En Andrew [15 ]
Chen, Shiyang [16 ]
Zhang, Minjia [17 ]
Li, Conglong [18 ]
Song, Shuaiwen Leon [18 ]
He, Yuxiong [18 ]
Sorger, Peter K. [3 ]
Mostaque, Emad [19 ]
Zhang, Zhao [16 ]
Bonneau, Richard [12 ]
AlQuraishi, Mohammed [1 ]
机构
[1] Columbia Univ, Dept Syst Biol, New York, NY 10032 USA
[2] Harvard Univ, Cambridge, MA USA
[3] Harvard Med Sch, Lab Syst Pharmacol, Boston, MA 02115 USA
[4] Icahn Sch Med Mt Sinai, New York, NY USA
[5] NYU, Courant Inst Math Sci, Dept Comp Sci, New York, NY USA
[6] Flatiron Inst, New York, NY USA
[7] OpenBioML, Cambridge, MA USA
[8] Univ Utah, Sci Comp & Imaging Inst, Salt Lake City, UT USA
[9] NVIDIA, Santa Clara, CA USA
[10] EleutherAI, New York, NY USA
[11] Booz Allen Hamilton, Mclean, VA USA
[12] Prescient Design, Genentech, New York, NY USA
[13] Cyrus Bio, Seattle, WA USA
[14] Outpace Bio, Seattle, WA USA
[15] Arzeda, Seattle, WA USA
[16] Rutgers State Univ, New Brunswick, NJ USA
[17] Univ Illinois Champaign Urbana, Champaign, IL USA
[18] Microsoft, Redmond, WA USA
[19] Stability AI, Los Altos, CA USA
关键词
PROTEIN-STRUCTURE PREDICTION; DOMAIN;
D O I
10.1038/s41592-024-02272-z
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
AlphaFold2 revolutionized structural biology with the ability to predict protein structures with exceptionally high accuracy. Its implementation, however, lacks the code and data required to train new models. These are necessary to (1) tackle new tasks, like protein-ligand complex structure prediction, (2) investigate the process by which the model learns and (3) assess the model's capacity to generalize to unseen regions of fold space. Here we report OpenFold, a fast, memory efficient and trainable implementation of AlphaFold2. We train OpenFold from scratch, matching the accuracy of AlphaFold2. Having established parity, we find that OpenFold is remarkably robust at generalizing even when the size and diversity of its training set is deliberately limited, including near-complete elisions of classes of secondary structure elements. By analyzing intermediate structures produced during training, we also gain insights into the hierarchical manner in which OpenFold learns to fold. In sum, our studies demonstrate the power and utility of OpenFold, which we believe will prove to be a crucial resource for the protein modeling community. OpenFold is a trainable open-source implementation of AlphaFold2. It is fast and memory efficient, and the code and training data are available under a permissive license.
引用
收藏
页码:1514 / 1524
页数:26
相关论文
共 67 条
  • [1] Ahdritz Gustaf, 2023, ADV NEUR IN
  • [2] Unified rational protein engineering with sequence-based deep representation learning
    Alley, Ethan C.
    Khimulya, Grigory
    Biswas, Surojit
    AlQuraishi, Mohammed
    Church, George M.
    [J]. NATURE METHODS, 2019, 16 (12) : 1315 - +
  • [3] The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures
    Andreeva, Antonina
    Kulesha, Eugene
    Gough, Julian
    Murzin, Alexey G.
    [J]. NUCLEIC ACIDS RESEARCH, 2020, 48 (D1) : D376 - D382
  • [4] PRINCIPLES THAT GOVERN FOLDING OF PROTEIN CHAINS
    ANFINSEN, CB
    [J]. SCIENCE, 1973, 181 (4096) : 223 - 230
  • [5] Baek M., 2021, Twitter
  • [6] Accurate prediction of protein-nucleic acid complexes using RoseTTAFoldNA
    Baek, Minkyung
    Mchugh, Ryan
    Anishchenko, Ivan
    Jiang, Hanlun
    Baker, David
    DiMaio, Frank
    [J]. NATURE METHODS, 2024, 21 (01) : 117 - 121
  • [7] Highly significant improvement of protein sequence alignments with AlphaFold2
    Baltzis, Athanasios
    Mansouri, Leila
    Jin, Suzanne
    Langer, Bjorn E.
    Erb, Ionas
    Notredame, Cedric
    Martelli, Pier Luigi
    [J]. BIOINFORMATICS, 2022, 38 (22) : 5007 - 5011
  • [8] Bradbury James, 2018, JAX COMPOSABLE TRANS
  • [9] Improved prediction of protein-protein interactions using AlphaFold2
    Bryant, P.
    Pozzati, G.
    Elofsson, A.
    [J]. NATURE COMMUNICATIONS, 2022, 13 (01)
  • [10] Protein Data Bank: the single global archive for 3D macromolecular structure data
    Burley, Stephen K.
    Berman, Helen M.
    Bhikadiya, Charmi
    Bi, Chunxiao
    Chen, Li
    Di Costanzo, Luigi
    Christie, Cole
    Duarte, Jose M.
    Dutta, Shuchismita
    Feng, Zukang
    Ghosh, Sutapa
    Goodsell, David S.
    Green, Rachel Kramer
    Guranovic, Vladimir
    Guzenko, Dmytro
    Hudson, Brian P.
    Liang, Yuhe
    Lowe, Robert
    Peisach, Ezra
    Periskova, Irina
    Randle, Chris
    Rose, Alexander
    Sekharan, Monica
    Shao, Chenghua
    Tao, Yi-Ping
    Valasatava, Yana
    Voigt, Maria
    Westbrook, John
    Young, Jasmine
    Zardecki, Christine
    Zhuravleva, Marina
    Kurisu, Genji
    Nakamura, Haruki
    Kengaku, Yumiko
    Cho, Hasumi
    Sato, Junko
    Kim, Ju Yaen
    Ikegawa, Yasuyo
    Nakagawa, Atsushi
    Yamashita, Reiko
    Kudou, Takahiro
    Bekker, Gert-Jan
    Suzuki, Hirofumi
    Iwata, Takeshi
    Yokochi, Masashi
    Kobayashi, Naohiro
    Fujiwara, Toshimichi
    Velankar, Sameer
    Kleywegt, Gerard J.
    Anyango, Stephen
    [J]. NUCLEIC ACIDS RESEARCH, 2019, 47 (D1) : D520 - D528