Efficient iterative virtual screening with Apache Spark and conformal prediction

被引:31
作者
Ahmed, Laeeq [1 ]
Georgiev, Valentin [2 ]
Capuccini, Marco [2 ,3 ]
Toor, Salman [3 ]
Schaal, Wesley [2 ]
Laure, Erwin [1 ]
Spjuth, Ola [2 ]
机构
[1] Royal Inst Technol KTH, Dept Computat Sci & Technol, Lindstedtsvagen 5, S-10044 Stockholm, Sweden
[2] Uppsala Univ, Dept Pharmaceut Biosci, Box 591, S-75124 Uppsala, Sweden
[3] Uppsala Univ, Dept Informat Technol, Box 337, S-75105 Uppsala, Sweden
关键词
Virtual screening; Docking; Conformal prediction; Cloud computing; Apache Spark; DRUG DISCOVERY; LARGE-SCALE; BENCHMARKING; DOCKING; QSAR;
D O I
10.1186/s13321-018-0265-z
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Background: Docking and scoring large libraries of ligands against target proteins forms the basis of structure-based virtual screening. The problem is trivially parallelizable, and calculations are generally carried out on computer clusters or on large workstations in a brute force manner, by docking and scoring all available ligands. Contribution: In this study we propose a strategy that is based on iteratively docking a set of ligands to form a training set, training a ligand-based model on this set, and predicting the remainder of the ligands to exclude those predicted as 'low-scoring' ligands. Then, another set of ligands are docked, the model is retrained and the process is repeated until a certain model efficiency level is reached. Thereafter, the remaining ligands are docked or excluded based on this model. We use SVM and conformal prediction to deliver valid prediction intervals for ranking the predicted ligands, and Apache Spark to parallelize both the docking and the modeling. Results: We show on 4 different targets that conformal prediction based virtual screening (CPVS) is able to reduce the number of docked molecules by 62.61% while retaining an accuracy for the top 30 hits of 94% on average and a speedup of 3.7. The implementation is available as open source via GitHub (https://github.com/laeeq80/spark-cpvs) and can be run on high-performance computers as well as on cloud resources.
引用
收藏
页数:8
相关论文
共 34 条
[21]   Machine learning methods in chemoinformatics [J].
Mitchell, John B. O. .
WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE, 2014, 4 (05) :468-481
[22]   Support vector machine models in drug design: applications to drug transport processes and QSAR using simplex optimisations and variable selection [J].
Norinder, U .
NEUROCOMPUTING, 2003, 55 (1-2) :337-346
[23]   Binary classification of imbalanced datasets using conformal prediction [J].
Norinder, Ulf ;
Boyer, Scott .
JOURNAL OF MOLECULAR GRAPHICS & MODELLING, 2017, 72 :256-265
[24]   Introducing Conformal Prediction in Predictive Modeling. A Transparent and Flexible Alternative to Applicability Domain Determination [J].
Norinder, Ulf ;
Carlsson, Lars ;
Boyer, Scott ;
Eklund, Martin .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2014, 54 (06) :1596-1603
[25]   SureChEMBL: a large-scale, chemically annotated patent document database [J].
Papadatos, George ;
Davies, Mark ;
Dedman, Nathan ;
Chambers, Jon ;
Gaulton, Anna ;
Siddle, James ;
Koks, Richard ;
Irvine, Sean A. ;
Pettersson, Joe ;
Goncharoff, Nicko ;
Hersey, Anne ;
Overington, John P. .
NUCLEIC ACIDS RESEARCH, 2016, 44 (D1) :D1220-D1228
[26]   Discovery of Novel ROCK1 Inhibitors via Integrated Virtual Screening Strategy and Bioassays [J].
Shen, Mingyun ;
Tian, Sheng ;
Pan, Peichen ;
Sun, Huiyong ;
Li, Dan ;
Li, Youyong ;
Zhou, Hefeng ;
Li, Chuwen ;
Lee, Simon Ming-Yuen ;
Hou, Tingjun .
SCIENTIFIC REPORTS, 2015, 5
[27]   Virtual screening of chemical libraries [J].
Shoichet, BK .
NATURE, 2004, 432 (7019) :862-865
[28]   Conformal Anomaly Detection of Trajectories with a Multi-class Hierarchy [J].
Smith, James ;
Nouretdinov, Ilia ;
Craddock, Rachel ;
Offer, Charles ;
Gammerman, Alexander .
STATISTICAL LEARNING AND DATA SCIENCES, 2015, 9047 :281-290
[29]   Virtual high throughput screening (vHTS) - A perspective [J].
Subramaniam, Sangeetha ;
Mehrotra, Monica ;
Gupta, Dinesh .
BIOINFORMATION, 2008, 3 (01) :14-17
[30]   Constructing and Validating High-Performance MIEC-SVM Models in Virtual Screening for Kinases: A Better Way for Actives Discovery [J].
Sun, Huiyong ;
Pan, Peichen ;
Tian, Sheng ;
Xu, Lei ;
Kong, Xiaotian ;
Li, Youyong ;
Li, Dan ;
Hou, Tingjun .
SCIENTIFIC REPORTS, 2016, 6