Neural networks to learn protein sequence-function relationships from deep mutational scanning data

被引:84
作者
Gelman, Sam [1 ,2 ]
Fahlberg, Sarah A. [3 ]
Heinzelman, Pete [3 ]
Romero, Philip A. [3 ]
Gitter, Anthony [1 ,2 ,4 ]
机构
[1] Univ Wisconsin, Dept Comp Sci, Madison, WI 53706 USA
[2] Morgridge Inst Res, Madison, WI 53715 USA
[3] Univ Wisconsin, Dept Biochem, Madison, WI 53706 USA
[4] Univ Wisconsin, Dept Biostat & Med Informat, Madison, WI 53792 USA
关键词
protein engineering; deep learning; convolutional neural network; FITNESS LANDSCAPE; EPISTASIS;
D O I
10.1073/pnas.2104878118
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The mapping from protein sequence to function is highly complex, making it challenging to predict how sequence changes will affect a protein's behavior and properties. We present a supervised deep learning framework to learn the sequence-function mapping from deep mutational scanning data and make predictions for new, uncharacterized sequence variants. We test multiple neural network architectures, including a graph convolutional network that incorporates protein structure, to explore how a network's internal representation affects its ability to learn the sequence-function mapping. Our supervised learning approach displays superior performance over physics-based and unsupervised prediction methods. We find that networks that capture nonlinear interactions and share parameters across sequence positions are important for learning the relationship between sequence and function. Further analysis of the trained models reveals the networks' ability to learn biologically meaningful information about protein structure and mechanism. Finally, we demonstrate the models' ability to navigate sequence space and design new proteins beyond the training set. We applied the protein G B1 domain (GB1) models to design a sequence that binds to immunoglobulin G with substantially higher affinity than wild-type GB1.
引用
收藏
页数:12
相关论文
共 66 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]   A method and server for predicting damaging missense mutations [J].
Adzhubei, Ivan A. ;
Schmidt, Steffen ;
Peshkin, Leonid ;
Ramensky, Vasily E. ;
Gerasimova, Anna ;
Bork, Peer ;
Kondrashov, Alexey S. ;
Sunyaev, Shamil R. .
NATURE METHODS, 2010, 7 (04) :248-249
[3]   Epistatic Net allows the sparse spectral regularization of deep neural networks for inferring fitness functions [J].
Aghazadeh, Amirali ;
Nisonoff, Hunter ;
Ocal, Orhan ;
Brookes, David H. ;
Huang, Yijie ;
Koyluoglu, O. Ozan ;
Listgarten, Jennifer ;
Ramchandran, Kannan .
NATURE COMMUNICATIONS, 2021, 12 (01)
[4]   The Rosetta All-Atom Energy Function for Macromolecular Modeling and Design [J].
Alford, Rebecca F. ;
Leaver-Fay, Andrew ;
Jeliazkov, Jeliazko R. ;
O'Meara, Matthew J. ;
DiMaio, Frank P. ;
Park, Hahnbeom ;
Shapovalov, Maxim V. ;
Renfrew, P. Douglas ;
Mulligan, Vikram K. ;
Kappel, Kalli ;
Labonte, Jason W. ;
Pacella, Michael S. ;
Bonneau, Richard ;
Bradley, Philip ;
Dunbrack, Roland L., Jr. ;
Das, Rhiju ;
Baker, David ;
Kuhlman, Brian ;
Kortemme, Tanja ;
Gray, Jeffrey J. .
JOURNAL OF CHEMICAL THEORY AND COMPUTATION, 2017, 13 (06) :3031-3048
[5]   Unified rational protein engineering with sequence-based deep representation learning [J].
Alley, Ethan C. ;
Khimulya, Grigory ;
Biswas, Surojit ;
AlQuraishi, Mohammed ;
Church, George M. .
NATURE METHODS, 2019, 16 (12) :1315-+
[6]  
Ancona M., 2018, 6 INT C LEARN REPR I, DOI DOI 10.1109/TNSE.2020.2996738
[7]   Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics [J].
Asgari, Ehsaneddin ;
Mofrad, Mohammad R. K. .
PLOS ONE, 2015, 10 (11)
[8]  
Biswas S., 2018, bioRxiv, P337154, DOI [DOI 10.1101/337154, 10.1101/337154.]
[9]   Low-N protein engineering with data-efficient deep learning [J].
Biswas, Surojit ;
Khimulya, Grigory ;
Alley, Ethan C. ;
Esvelt, Kevin M. ;
Church, George M. .
NATURE METHODS, 2021, 18 (04) :389-+
[10]  
Brookes D. H., 2021, CONDITIONING ADAPTIV