Large language models generate functional protein sequences across diverse families

被引:463
作者
Madani, Ali [1 ,2 ]
Ben Krause, Ben [1 ]
Greene, Eric R. [3 ]
Subramanian, Subu [4 ,5 ]
Mohr, Benjamin P. [6 ]
Holton, James M. [7 ,8 ,9 ]
Olmos, Jose Luis [3 ]
Xiong, Caiming [1 ]
Sun, Zachary Z. Z. [6 ]
Socher, Richard [1 ]
Fraser, James S. [3 ]
Naik, Nikhil [1 ]
机构
[1] Salesforce Res, Palo Alto, CA 94301 USA
[2] Profluent Bio, San Francisco, CA 94118 USA
[3] Univ Calif San Francisco, Dept Bioengn & Therapeut Sci, San Francisco, CA USA
[4] Univ Calif Berkeley, Dept Mol & Cell Biol, Berkeley, CA USA
[5] Univ Calif Berkeley, Howard Hughes Med Inst, Berkeley, CA USA
[6] Tierra Biosci, San Leandro, CA USA
[7] Lawrence Berkeley Natl Lab, Mol Biophys & Integrated Bioimaging Div, Berkeley, CA USA
[8] SLAC Natl Accelerator Lab, Stanford Synchrotron Radiat Lightsource, Menlo Pk, CA USA
[9] Univ Calif San Francisco, Dept Biochem & Biophys, San Francisco, CA USA
基金
美国国家卫生研究院;
关键词
STRUCTURE REFINEMENT; T4; LYSOZYME; CONTACTS;
D O I
10.1038/s41587-022-01618-2
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
A generative deep-learning model designs artificial proteins with desired enzymatic activities. Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.
引用
收藏
页码:1099 / +
页数:17
相关论文
共 86 条
[1]   Towards automated crystallographic structure refinement with phenix.refine [J].
Afonine, Pavel V. ;
Grosse-Kunstleve, Ralf W. ;
Echols, Nathaniel ;
Headd, Jeffrey J. ;
Moriarty, Nigel W. ;
Mustyakimov, Marat ;
Terwilliger, Thomas C. ;
Urzhumtsev, Alexandre ;
Zwart, Peter H. ;
Adams, Paul D. .
ACTA CRYSTALLOGRAPHICA SECTION D-STRUCTURAL BIOLOGY, 2012, 68 :352-367
[2]   Unified rational protein engineering with sequence-based deep representation learning [J].
Alley, Ethan C. ;
Khimulya, Grigory ;
Biswas, Surojit ;
AlQuraishi, Mohammed ;
Church, George M. .
NATURE METHODS, 2019, 16 (12) :1315-+
[3]  
AlQuraishi M., 2019, SOME THOUGHTS MYSTER
[4]   Protein sequence design with a learned potential [J].
Anand, Namrata ;
Eguchi, Raphael ;
Mathews, Irimpan I. ;
Perez, Carla P. ;
Derry, Alexander ;
Altman, Russ B. ;
Huang, Po-Ssu .
NATURE COMMUNICATIONS, 2022, 13 (01)
[5]   De novo protein design by deep network hallucination [J].
Anishchenko, Ivan ;
Pellock, Samuel J. ;
Chidyausiku, Tamuka M. ;
Ramelot, Theresa A. ;
Ovchinnikov, Sergey ;
Hao, Jingzhou ;
Bafna, Khushboo ;
Norn, Christoffer ;
Kang, Alex ;
Bera, Asim K. ;
DiMaio, Frank ;
Carter, Lauren ;
Chow, Cameron M. ;
Montelione, Gaetano T. ;
Baker, David .
NATURE, 2021, 600 (7889) :547-+
[6]  
[Anonymous], 2013, Pmlr, DOI DOI 10.48550/ARXIV.1211.5063
[7]   Gene Ontology: tool for the unification of biology [J].
Ashburner, M ;
Ball, CA ;
Blake, JA ;
Botstein, D ;
Butler, H ;
Cherry, JM ;
Davis, AP ;
Dolinski, K ;
Dwight, SS ;
Eppig, JT ;
Harris, MA ;
Hill, DP ;
Issel-Tarver, L ;
Kasarskis, A ;
Lewis, S ;
Matese, JC ;
Richardson, JE ;
Ringwald, M ;
Rubin, GM ;
Sherlock, G .
NATURE GENETICS, 2000, 25 (01) :25-29
[8]   Lessons from the lysozyme of phage T4 [J].
Baase, Walter A. ;
Liu, Lijun ;
Tronrud, Dale E. ;
Matthews, Brian W. .
PROTEIN SCIENCE, 2010, 19 (04) :631-641
[9]   The universal protein resource (UniProt) [J].
Bairoch, A ;
Apweiler, R ;
Wu, CH ;
Barker, WC ;
Boeckmann, B ;
Ferro, S ;
Gasteiger, E ;
Huang, HZ ;
Lopez, R ;
Magrane, M ;
Martin, MJ ;
Natale, DA ;
O'Donovan, C ;
Redaschi, N ;
Yeh, LSL .
NUCLEIC ACIDS RESEARCH, 2005, 33 :D154-D159
[10]   Learning generative models for protein fold families [J].
Balakrishnan, Sivaraman ;
Kamisetty, Hetunandan ;
Carbonell, Jaime G. ;
Lee, Su-In ;
Langmead, Christopher James .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2011, 79 (04) :1061-1078