A practical guide to machine-learning scoring for structure-based virtual screening

被引:41
作者
Tran-Nguyen, Viet-Khoa [1 ]
Junaid, Muhammad [1 ]
Simeon, Saw [1 ]
Ballester, Pedro J. [2 ]
机构
[1] Ctr Rech Cancerol Marseille, Marseille, France
[2] Imperial Coll London, Dept Bioengn, London, England
关键词
ASSAY INTERFERENCE COMPOUNDS; LIGAND BINDING-AFFINITY; SWISS-MODEL REPOSITORY; APPLICABILITY DOMAIN; MOLECULAR DOCKING; COMPOUNDS PAINS; DATA SETS; PROTEIN; DISCOVERY; ACCURACY;
D O I
10.1038/s41596-023-00885-w
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Structure-based virtual screening (SBVS) via docking has been used to discover active molecules for a range of therapeutic targets. Chemical and protein data sets that contain integrated bioactivity information have increased both in number and in size. Artificial intelligence and, more concretely, its machine-learning (ML) branch, including deep learning, have effectively exploited these data sets to build scoring functions (SFs) for SBVS against targets with an atomic-resolution 3D model (e.g., generated by X-ray crystallography or predicted by AlphaFold2). Often outperforming their generic and non-ML counterparts, target-specific ML-based SFs represent the state of the art for SBVS. Here, we present a comprehensive and user-friendly protocol to build and rigorously evaluate these new SFs for SBVS. This protocol is organized into four sections: (i) using a public benchmark of a given target to evaluate an existing generic SF; (ii) preparing experimental data for a target from public repositories; (iii) partitioning data into a training set and a test set for subsequent target-specific ML modeling; and (iv) generating and evaluating target-specific ML SFs by using the prepared training-test partitions. All necessary code and input/output data related to three example targets (acetylcholinesterase, HMG-CoA reductase, and peroxisome proliferator-activated receptor-alpha) are available at https://github. com/vktrannguyen/MLSF-protocol, can be run by using a single computer within 1 week and make use of easily accessible software/programs (e.g., Smina, CNN-Score, RF-Score-VS and DeepCoy) and web resources. Our aim is to provide practical guidance on how to augment training data to enhance SBVS performance, how to identify the most suitable supervised learning algorithm for a data set, and how to build an SF with the highest likelihood of discovering target-active molecules within a given compound library.
引用
收藏
页码:3460 / 3511
页数:52
相关论文
共 166 条
[81]   Benchmarking Data Sets for the Evaluation of Virtual Ligand Screening Methods: Review and Perspectives [J].
Lagarde, Nathalie ;
Zagury, Jean-Francois ;
Montes, Matthieu .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2015, 55 (07) :1297-1307
[82]   Discovery of peptide ligands through docking and virtual screening at nicotinic acetylcholine receptor homology models [J].
Leffler, Abba E. ;
Kuryatov, Alexander ;
Zebroski, Henry A. ;
Powell, Susan R. ;
Filipenko, Petr ;
Hussein, Adel K. ;
Gorson, Juliette ;
Heizmann, Anna ;
Lyskov, Sergey ;
Tsien, Richard W. ;
Poget, Sebastien F. ;
Nicke, Annette ;
Lindstrom, Jon ;
Rudy, Bernardo ;
Bonneau, Richard ;
Holford, Mande .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2017, 114 (38) :E8100-E8109
[83]   Machine-learning scoring functions for structure-based virtual screening [J].
Li Hongjian ;
Sze, Kam-Heung ;
Lu Gang ;
Ballester, Pedro J. .
WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE, 2021, 11 (01)
[84]   Machine-learning scoring functions for structure-based drug lead optimization [J].
Li, Hongjian ;
Sze, Kam-Heung ;
Lu, Gang ;
Ballester, Pedro J. .
WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE, 2020, 10 (05)
[85]   Classical scoring functions for docking are unable to exploit large volumes of structural and interaction data [J].
Li, Hongjian ;
Peng, Jiangjun ;
Sidorov, Pavel ;
Leung, Yee ;
Leung, Kwong-Sak ;
Wong, Man-Hon ;
Lu, Gang ;
Ballester, Pedro J. .
BIOINFORMATICS, 2019, 35 (20) :3989-3995
[86]   The Impact of Protein Structure and Sequence Similarity on the Accuracy of Machine-Learning Scoring Functions for Binding Affinity Prediction [J].
Li, Hongjian ;
Peng, Jiangjun ;
Leung, Yee ;
Leung, Kwong-Sak ;
Wong, Man-Hon ;
Lu, Gang ;
Ballester, Pedro J. .
BIOMOLECULES, 2018, 8 (01)
[87]   Correcting the impact of docking pose generation error on binding affinity prediction [J].
Li, Hongjian ;
Leung, Kwong-Sak ;
Wong, Man-Hon ;
Ballester, Pedro J. .
BMC BIOINFORMATICS, 2016, 17
[88]   Improving AutoDock Vina Using Random Forest: The Growing Accuracy of Binding Affinity Prediction by the Effective Exploitation of Larger Data Sets [J].
Li, Hongjian ;
Leung, Kwong-Sak ;
Wong, Man-Hon ;
Ballester, Pedro J. .
MOLECULAR INFORMATICS, 2015, 34 (2-3) :115-126
[89]   Target-Specific Support Vector Machine Scoring in Structure-Based Virtual Screening: Computational Validation, On Vitro Testing in Kinases, and Effects on Lung Cancer Cell Proliferation [J].
Li, Liwei ;
Khanna, May ;
Jo, Inha ;
Wang, Fang ;
Ashpole, Nicole M. ;
Hudmon, Andy ;
Meroueh, Samy O. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2011, 51 (04) :755-759
[90]   Practical Model Selection for Prospective Virtual Screening [J].
Liu, Shengchao ;
Alnammi, Moayad ;
Ericksen, Spencer S. ;
Voter, Andrew F. ;
Ananiev, Gene E. ;
Keck, James L. ;
Hoffmann, F. Michael ;
Wildman, Scott A. ;
Gitter, Anthony .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2019, 59 (01) :282-293