Open Source Bayesian Models. 2. Mining a "Big Dataset" To Create and Validate Models with ChEMBL

被引:58
作者
Clark, Alex M. [1 ]
Ekins, Sean [2 ,3 ,4 ]
机构
[1] Mol Mat Informat Inc, Montreal, PQ H3J 2S1, Canada
[2] Collaborat Pharmaceut Inc, Fuquay Varina, NC 27526 USA
[3] Collaborat Chem, Fuquay Varina, NC 27526 USA
[4] Collaborat Drug Discovery, Burlingame, CA 94010 USA
关键词
SUPPORT VECTOR MACHINE; ABL TYROSINE KINASE; DRUG DISCOVERY; MYCOBACTERIUM-TUBERCULOSIS; IN-SILICO; SMALL MOLECULES; ANTITUBERCULOSIS AGENTS; QSAR MODELS; TB MOBILE; CHEMOGENOMICS;
D O I
10.1021/acs.jcim.5b00144
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
In an associated paper, we have described a reference implementation of Laplacian-corrected naive Bayesian model building using extended connectivity (ECFP)- and molecular function class fingerprints of maximum diameter 6 (FCFP)-type fingerprints. As a follow-up, we have now undertaken a large-scale validation study in order to ensure that the technique generalizes to a broad variety of drug discovery datasets. To achieve this, we have used the ChEMBL (version 20) database and split it into more than 2000 separate datasets, each of which consists of compounds and measurements with the same target and activity measurement. In order to test these datasets with the two-state Bayesian classification, we developed an automated algorithm for detecting a suitable threshold for active/inactive designation, which we applied to all collections. With these datasets, we were able to establish that our Bayesian model implementation is effective for the large majority of cases, and we were able to quantify the impact of fingerprint folding on the receiver operator curve cross-validation metrics. We were also able to study the impact that the choice of training/testing set partitioning has on the resulting recall rates. The datasets have been made publicly available to be downloaded, along with the corresponding model data files, which can be used in conjunction with the CDK and several mobile apps. We have also explored some novel visualization methods which leverage the structural origins of the ECFP/FCFP fingerprints to attribute regions of a molecule responsible for positive and negative contributions to activity. The ability to score molecules across thousands of relevant datasets across organisms also may help to access desirable and undesirable off-target effects as well as suggest potential targets for compounds derived from phenotypic screens.
引用
收藏
页码:1246 / 1260
页数:15
相关论文
共 92 条
[1]   High-throughput screening for inhibitors of Mycobacterium tuberculosis H37Rv [J].
Ananthan, Subramaniam ;
Faaleolea, Ellen R. ;
Goldman, Robert C. ;
Hobrath, Judith V. ;
Kwong, Cecil D. ;
Laughon, Barbara E. ;
Maddry, Joseph A. ;
Mehta, Alka ;
Rasmussen, Lynn ;
Reynolds, Robert C. ;
Secrist, John A., III ;
Shindo, Nice ;
Showe, Dustin N. ;
Sosa, Melinda I. ;
Suling, William J. ;
White, E. Lucile .
TUBERCULOSIS, 2009, 89 (05) :334-353
[2]  
[Anonymous], 2014, PLOS ONE
[3]  
[Anonymous], CHEM BIOINFORMATICS
[4]   Development of CYP3A4 inhibition models: Comparisons of machine-learning techniques and molecular descriptors [J].
Arimoto, R ;
Prasad, MA ;
Gifford, EM .
JOURNAL OF BIOMOLECULAR SCREENING, 2005, 10 (03) :197-205
[5]   Measurement of baseline toxicity and QSAR analysis of 50 non-polar and 58 polar narcotic chemicals for the alga Pseudokirchneriella subcapitata [J].
Aruoja, Villem ;
Moosus, Maikki ;
Kahru, Anne ;
Sihtmaee, Mariliis ;
Maran, Uko .
CHEMOSPHERE, 2014, 96 :23-32
[6]   Multidrug and Toxin Extruder Proteins MATE1 and MATE2-K [J].
Astorga, Bethzaida ;
Ekins, Sean ;
Morales, Mark ;
Wright, Stephen H. .
JOURNAL OF PHARMACOLOGY AND EXPERIMENTAL THERAPEUTICS, 2012, 341 (03) :743-755
[7]   Introduction of a Methodology for Visualization and Graphical Interpretation of Bayesian Classification Models [J].
Balfer, Jenny ;
Bajorath, Juergen .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2014, 54 (09) :2451-2468
[8]   New small-molecule synthetic antimycobacterials [J].
Ballell, L ;
Field, RA ;
Duncan, K ;
Young, RJ .
ANTIMICROBIAL AGENTS AND CHEMOTHERAPY, 2005, 49 (06) :2153-2163
[9]   Analysis of pharmacology data and the prediction of adverse drug reactions and off-target effects from chemical structure [J].
Bender, Andreas ;
Scheiber, Josef ;
Glick, Meir ;
Davies, John W. ;
Azzaoui, Kamal ;
Hamon, Jacques ;
Urban, Laszlo ;
Whitebread, Steven ;
Jenkins, Jeremy L. .
CHEMMEDCHEM, 2007, 2 (06) :861-873
[10]   The ChEMBL bioactivity database: an update [J].
Bento, A. Patricia ;
Gaulton, Anna ;
Hersey, Anne ;
Bellis, Louisa J. ;
Chambers, Jon ;
Davies, Mark ;
Krueger, Felix A. ;
Light, Yvonne ;
Mak, Lora ;
McGlinchey, Shaun ;
Nowotka, Michal ;
Papadatos, George ;
Santos, Rita ;
Overington, John P. .
NUCLEIC ACIDS RESEARCH, 2014, 42 (D1) :D1083-D1090