Open Source Bayesian Models. 2. Mining a "Big Dataset" To Create and Validate Models with ChEMBL

被引:58
作者
Clark, Alex M. [1 ]
Ekins, Sean [2 ,3 ,4 ]
机构
[1] Mol Mat Informat Inc, Montreal, PQ H3J 2S1, Canada
[2] Collaborat Pharmaceut Inc, Fuquay Varina, NC 27526 USA
[3] Collaborat Chem, Fuquay Varina, NC 27526 USA
[4] Collaborat Drug Discovery, Burlingame, CA 94010 USA
关键词
SUPPORT VECTOR MACHINE; ABL TYROSINE KINASE; DRUG DISCOVERY; MYCOBACTERIUM-TUBERCULOSIS; IN-SILICO; SMALL MOLECULES; ANTITUBERCULOSIS AGENTS; QSAR MODELS; TB MOBILE; CHEMOGENOMICS;
D O I
10.1021/acs.jcim.5b00144
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
In an associated paper, we have described a reference implementation of Laplacian-corrected naive Bayesian model building using extended connectivity (ECFP)- and molecular function class fingerprints of maximum diameter 6 (FCFP)-type fingerprints. As a follow-up, we have now undertaken a large-scale validation study in order to ensure that the technique generalizes to a broad variety of drug discovery datasets. To achieve this, we have used the ChEMBL (version 20) database and split it into more than 2000 separate datasets, each of which consists of compounds and measurements with the same target and activity measurement. In order to test these datasets with the two-state Bayesian classification, we developed an automated algorithm for detecting a suitable threshold for active/inactive designation, which we applied to all collections. With these datasets, we were able to establish that our Bayesian model implementation is effective for the large majority of cases, and we were able to quantify the impact of fingerprint folding on the receiver operator curve cross-validation metrics. We were also able to study the impact that the choice of training/testing set partitioning has on the resulting recall rates. The datasets have been made publicly available to be downloaded, along with the corresponding model data files, which can be used in conjunction with the CDK and several mobile apps. We have also explored some novel visualization methods which leverage the structural origins of the ECFP/FCFP fingerprints to attribute regions of a molecule responsible for positive and negative contributions to activity. The ability to score molecules across thousands of relevant datasets across organisms also may help to access desirable and undesirable off-target effects as well as suggest potential targets for compounds derived from phenotypic screens.
引用
收藏
页码:1246 / 1260
页数:15
相关论文
共 92 条
[61]   Chemogenomics and parasitology: Small molecules and cell-based assays to study infectious processes [J].
Muskavitch, Marc A. T. ;
Barteneva, Natasha ;
Gubbels, Marc-Jan .
COMBINATORIAL CHEMISTRY & HIGH THROUGHPUT SCREENING, 2008, 11 (08) :624-646
[62]   Prediction of biological targets for compounds using multiple-category Bayesian models trained on chemogenomics databases [J].
Nidhi ;
Glick, Meir ;
Davies, John W. ;
Jenkins, Jeremy L. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2006, 46 (03) :1124-1133
[63]   Identification and Validation of Novel Human Pregnane X Receptor Activators among Prescribed Drugs via Ligand-Based Virtual Screening [J].
Pan, Yongmei ;
Li, Linhao ;
Kim, Gregory ;
Ekins, Sean ;
Wang, Hongbing ;
Swaan, Peter W. .
DRUG METABOLISM AND DISPOSITION, 2011, 39 (02) :337-344
[64]   Global mapping of pharmacological space [J].
Paolini, Gaia V. ;
Shapland, Richard H. B. ;
van Hoorn, Willem P. ;
Mason, Jonathan S. ;
Hopkins, Andrew L. .
NATURE BIOTECHNOLOGY, 2006, 24 (07) :805-815
[65]   The ChEMBL database: a taster for medicinal chemists [J].
Papadatos, George ;
Overington, John P. .
FUTURE MEDICINAL CHEMISTRY, 2014, 6 (04) :361-364
[66]   A Virtual Screen Discovers Novel, Fragment-Sized Inhibitors of Mycobacterium tuberculosis InhA [J].
Perryman, Alexander L. ;
Yu, Weixuan ;
Wang, Xin ;
Ekins, Sean ;
Forli, Stefano ;
Li, Shao-Gang ;
Freundlich, Joel S. ;
Tonge, Peter J. ;
Olson, Arthur J. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2015, 55 (03) :645-659
[67]   Computational Models for Neglected Diseases: Gaps and Opportunities [J].
Ponder, Elizabeth L. ;
Freundlich, Joel S. ;
Sarker, Malabika ;
Ekins, Sean .
PHARMACEUTICAL RESEARCH, 2014, 31 (02) :271-277
[68]   Developing an antituberculosis compounds database and data mining in the search of a motif responsible for the activity of a diverse class of antituberculosis agents [J].
Prakash, O ;
Ghosh, I .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2006, 46 (01) :17-23
[69]   Global Bayesian Models for the Prioritization of Antitubercular Agents [J].
Prathipati, Philip ;
Ma, Ngai Ling ;
Keller, Thomas H. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2008, 48 (12) :2362-2370
[70]   High throughput screening of a library based on kinase inhibitor scaffolds against Mycobacterium tuberculosis H37Rv [J].
Reynolds, Robert C. ;
Ananthan, Subramaniam ;
Faaleolea, Ellen ;
Hobrath, Judith V. ;
Kwong, Cecil D. ;
Maddox, Clinton ;
Rasmussen, Lynn ;
Sosa, Melinda I. ;
Thammasuvimol, Elizabeth ;
White, E. Lucile ;
Zhang, Wei ;
Secrist, John A., III .
TUBERCULOSIS, 2012, 92 (01) :72-83