GTM-Based QSAR Models and Their Applicability Domains

被引:54
作者
Gaspar, H. A. [1 ]
Baskin, I. I. [1 ,2 ,3 ]
Marcou, G. [1 ]
Horvath, D. [1 ]
Varnek, A. [1 ,3 ]
机构
[1] Univ Strasbourg, Lab Chemoinformat, UMR 7140, F-67000 Strasbourg, France
[2] Moscow MV Lomonosov State Univ, Dept Phys, Moscow 119991, Russia
[3] Kazan Fed Univ, Butlerov Inst Chem, Lab Chemoinformat, Kazan, Russia
关键词
Generative topographic mapping; QSAR; Dimensionality reduction; Activity landscape; GTM descriptors; AQUEOUS SOLUBILITY; ORGANIC-COMPOUNDS; PREDICTION;
D O I
10.1002/minf.201400153
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
In this paper we demonstrate that Generative Topographic Mapping (GTM), a machine learning method traditionally used for data visualisation, can be efficiently applied to QSAR modelling using probability distribution functions (PDF) computed in the latent 2-dimensional space. Several different scenarios of the activity assessment were considered: (i) the "activity landscape" approach based on direct use of PDF, (ii) QSAR models involving GTM-generated on descriptors derived from PDF, and, (iii) the k-Nearest Neighbours approach in 2D latent space. Benchmarking calculations were performed on five different datasets: stability constants of metal cations Ca2+, Gd3+ and Lu3+ complexes with organic ligands in water, aqueous solubility and activity of thrombin inhibitors. It has been shown that the performance of GTM-based regression models is similar to that obtained with some popular machine-learning methods (random forest, k-NN, M5P regression tree and PLS) and ISIDA fragment descriptors. By comparing GTM activity landscapes built both on predicted and experimental activities, we may visually assess the model's performance and identify the areas in the chemical space corresponding to reliable predictions. The applicability domain used in this work is based on data likelihood. Its application has significantly improved the model performances for 4 out of 5 datasets.
引用
收藏
页码:348 / 356
页数:9
相关论文
共 28 条
[1]   Stochastic proximity embedding [J].
Agrafiotis, DK .
JOURNAL OF COMPUTATIONAL CHEMISTRY, 2003, 24 (10) :1215-1221
[2]  
[Anonymous], 2012, 7140 UMR U STRASB LA
[3]   GTM: The generative topographic mapping [J].
Bishop, CM ;
Svensen, M ;
Williams, CKI .
NEURAL COMPUTATION, 1998, 10 (01) :215-234
[4]  
Borg I., 2005, Modern multidimensional scaling: theory and applications
[5]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[6]   Generative Topographic Mapping-Based Classification Models and Their Applicability Domain: Application to the Biopharmaceutics Drug Disposition Classification System (BDDCS) [J].
Gaspar, Helena A. ;
Marcou, Gilles ;
Horvath, Dragos ;
Arault, Alban ;
Lozano, Sylvain ;
Vayer, Philippe ;
Varnek, Alexandre .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2013, 53 (12) :3318-3325
[7]  
Hall M., 2009, SIGKDD Explor Newsl, V11, P10, DOI DOI 10.1145/1656274.1656278
[8]  
Hinton G.E., 2003, Adv. Neural Inform. Process. Syst., V15, DOI DOI 10.5555/2968618.2968725
[9]   Predicting the Predictability: A Unified Approach to the Applicability Domain Problem of QSAR Models [J].
Horvath, Dragos ;
Marcou, Gilles ;
Alexandre, Varnek .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2009, 49 (07) :1762-1776
[10]   Estimation of aqueous solubility for a diverse set of organic compounds based on molecular topology [J].
Huuskonen, J .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2000, 40 (03) :773-777