Three-Dimensional Convolutional Neural Networks and a Cross-Docked Data Set for Structure-Based Drug Design

被引:175
作者
Francoeur, Paul G. [1 ]
Masuda, Tomohide [1 ]
Sunseri, Jocelyn [1 ]
Jia, Andrew [1 ]
Iovanisci, Richard B. [1 ]
Snyder, Ian [1 ]
Koes, David R. [1 ]
机构
[1] Univ Pittsburgh, Dept Computat & Syst Biol, Pittsburgh, PA 15260 USA
基金
美国国家科学基金会;
关键词
SCORING FUNCTIONS; BINDING-AFFINITY; FORCE-FIELD; DOCKING; VALIDATION; APPROPRIATE; DISCOVERY;
D O I
10.1021/acs.jcim.0c00411
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
One of the main challenges in drug discovery is predicting protein-ligand binding affinity. Recently, machine learning approaches have made substantial progress on this task. However, current methods of model evaluation are overly optimistic in measuring generalization to new targets, and there does not exist a standard data set of sufficient size to compare performance between models. We present a new data set for structure-based machine learning, the CrossDocked2020 set, with 22.5 million poses of ligands docked into multiple similar binding pockets across the Protein Data Bank, and perform a comprehensive evaluation of grid-based convolutional neural network (CNN) models on this data set. We also demonstrate how the partitioning of the training data and test data can impact the results of models trained with the PDBbind data set, how performance improves by adding more lower-quality training data, and how training with docked poses imparts pose sensitivity to the predicted affinity of a complex. Our best performing model, an ensemble of five densely connected CNNs, achieves a root mean squared error of 1.42 and Pearson R of 0.612 on the affinity prediction task, an AUC of 0.956 at binding pose classification, and a 68.4% accuracy at pose selection on the CrossDocked2020 set. By providing data splits for clustered cross-validation and the raw data for the CrossDocked2020 set, we establish the first standardized data set for training machine learning models to recognize ligands in noncognate target structures while also greatly expanding the number of poses available for training. In order to facilitate community adoption of this data set for benchmarking protein-ligand binding affinity prediction, we provide our models, weights, and the CrossDocked2020 set at https://github.com/gnina/models.
引用
收藏
页码:4200 / 4215
页数:16
相关论文
共 61 条
[1]   Task-Specific Scoring Functions for Predicting Ligand Binding Poses and Affinity and for Screening Enrichment [J].
Ashtawy, Hossam M. ;
Mahapatra, Nihar R. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2018, 58 (01) :119-133
[2]   Structure-based Virtual Screening Approaches in Kinase-directed Drug Discovery [J].
Bajusz, David ;
Ferenczy, Gyorgy G. ;
Keseru, Gyorgy M. .
CURRENT TOPICS IN MEDICINAL CHEMISTRY, 2017, 17 (20) :2235-2259
[3]   Evol and ProDy for bridging protein sequence evolution and structural dynamics [J].
Bakan, Ahmet ;
Dutta, Anindita ;
Mao, Wenzhi ;
Liu, Ying ;
Chennubhotla, Chakra ;
Lezon, Timothy R. ;
Bahar, Ivet .
BIOINFORMATICS, 2014, 30 (18) :2681-2683
[4]   Comments on "Leave-Cluster-Out Cross-Validation Is Appropriate for Scoring Functions Derived from Diverse Protein Data Sets": Significance for the Validation of Scoring Functions [J].
Ballester, Pedro J. ;
Mitchell, John B. O. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2011, 51 (08) :1739-1741
[5]   A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking [J].
Ballester, Pedro J. ;
Mitchell, John B. O. .
BIOINFORMATICS, 2010, 26 (09) :1169-1175
[6]   Learning from the ligand: using ligand-based features to improve binding affinity prediction [J].
Boyles, Fergus ;
Deane, Charlotte M. ;
Morris, Garrett M. .
BIOINFORMATICS, 2020, 36 (03) :758-764
[7]   Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening [J].
Cang, Zixuan ;
Mu, Lin ;
Wei, Guo-Wei .
PLOS COMPUTATIONAL BIOLOGY, 2018, 14 (01)
[8]   Hidden bias in the DUD-E dataset leads to misleading performance of deep learning in structure-based virtual screening [J].
Chen, Lieyang ;
Cruz, Anthony ;
Ramsey, Steven ;
Dickson, Callum J. ;
Duca, Jose S. ;
Hornak, Viktor ;
Koes, David R. ;
Kurtzman, Tom .
PLOS ONE, 2019, 14 (08)
[9]   Structure-Based Virtual Screening for Drug Discovery: a Problem-Centric Review [J].
Cheng, Tiejun ;
Li, Qingliang ;
Zhou, Zhigang ;
Wang, Yanli ;
Bryant, Stephen H. .
AAPS JOURNAL, 2012, 14 (01) :133-141
[10]   Effects of inductive bias on computational evaluations of ligand-based modeling and on drug discovery [J].
Cleves, Ann E. ;
Jain, Ajay N. .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2008, 22 (3-4) :147-159