De Novo Protein Design for Novel Folds Using Guided Conditional Wasserstein Generative Adversarial Networks

被引:52
作者
Karimi, Mostafa [1 ,2 ]
Zhu, Shaowen [1 ]
Cao, Yue [1 ]
Shen, Yang [1 ,2 ]
机构
[1] Texas A&M Univ, Dept Elect & Comp Engn, College Stn, TX 77843 USA
[2] Texas A&M Univ, TEES AgriLife Ctr Bioinformat & Genom Syst Engn, College Stn, TX 77840 USA
基金
美国国家卫生研究院;
关键词
STRUCTURE PREDICTION; RESIDUE CONTACTS; PRINCIPLES; POTENTIALS; SIMILARITY; ALGORITHM; FRAMEWORK; SEQUENCE;
D O I
10.1021/acs.jcim.0c00593
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Although massive data is quickly accumulating on protein sequence and structure, there is a small and limited number of protein architectural types (or structural folds). This study is addressing the following question: how well could one reveal underlying sequence-structure relationships and design protein sequences for an arbitrary, potentially novel, structural fold? In response to the question, we have developed novel deep generative models, namely, semisupervised gcWGAN (guided, conditional, Wasserstein Generative Adversarial Networks). To overcome training difficulties and improve design qualities, we build our models on conditional Wasserstein GAN (WGAN) that uses Wasserstein distance in the loss function. Our major contributions include (1) constructing a low-dimensional and generalizable representation of the fold space for the conditional input, (2) developing an ultrafast sequence-to-fold predictor (or oracle) and incorporating its feedback into WGAN as a loss to guide model training, and (3) exploiting sequence data with and without paired structures to enable a semisupervised training strategy. Assessed by the oracle over 100 novel folds not in the training set, gcWGAN generates more successful designs and covers 3.5 times more target folds compared to a competing data-driven method (cVAE). Assessed by sequence- and structure-based predictors, gcWGAN designs are physically and biologically sound. Assessed by a structure predictor over representative novel folds, including one not even part of basis folds, gcWGAN designs have comparable or better fold accuracy yet much more sequence diversity and novelty than cVAE. The ultrafast data-driven model is further shown to boost the success of a principle-driven de novo method (RosettaDesign), through generating design seeds and tailoring design space. In conclusion, gcWGAN explores uncharted sequence space to design proteins by learning generalizable principles from current sequence-structure data. Data, source codes, and trained models are available at https://github.com/Shen-Lab/gcWGAN
引用
收藏
页码:5667 / 5681
页数:15
相关论文
共 73 条
[1]  
Alberts B., 2015, ESSENTIAL CELL BIOL
[2]  
Anand N., 2020, bioRxiv
[3]  
Anand N, 2018, ADV NEUR IN, V31
[4]  
ANFINSEN CB, 1962, J BIOL CHEM, V237, P1825
[5]   PRINCIPLES THAT GOVERN FOLDING OF PROTEIN CHAINS [J].
ANFINSEN, CB .
SCIENCE, 1973, 181 (4096) :223-230
[6]  
[Anonymous], 2017, Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models
[7]  
[Anonymous], 2009, STRUCTURAL BIOINFORM, DOI DOI 10.1088/0953-4075/42/5/055502
[8]  
[Anonymous], 2017, Advances in Neural Information Processing Systems
[9]  
[Anonymous], 2018, ARXIV180400891
[10]  
Arjovsky M, 2017, PR MACH LEARN RES, V70