Melting point prediction employing k-nearest neighbor algorithms and genetic parameter optimization

被引:137
作者
Nigsch, Florian
Bender, Andreas
van Buuren, Bernd
Tissen, Jos
Nigsch, Eduard
Mitchell, John B. O.
机构
[1] Univ Cambridge, Dept Chem, Unilever Ctr Mol Sci Informat, Cambridge CB2 1EW, England
[2] Novartis Inst Biomed Res Inc, Lead Discovery Informat, Cambridge, MA 02139 USA
[3] Unilever R&D Vlaardingen, Food & Hlth Res Inst, NL-3133 AC Vlaardingen, Netherlands
[4] Tech Univ Vienna, Fac Math & Geoinformat, A-1040 Vienna, Austria
关键词
D O I
10.1021/ci060149f
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
We have applied the k-nearest neighbor (kNN) modeling technique to the prediction of melting points. A data set of 4119 diverse organic molecules (data set 1) and an additional set of 277 drugs (data set 2) were used to compare performance in different regions of chemical space, and we investigated the influence of the number of nearest neighbors using different types of molecular descriptors. To compute the prediction on the basis of the melting temperatures of the nearest neighbors, we used four different methods (arithmetic and geometric average, inverse distance weighting, and exponential weighting), of which the exponential weighting scheme yielded the best results. We assessed our model via a 25-fold Monte Carlo cross-validation (with approximately 30% of the total data as a test set) and optimized it using a genetic algorithm. Predictions for drugs based on drugs (separate training and test sets each taken from data set 2) were found to be considerably better [root-mean-squared error (RMSE) = 46.3 degrees C, r(2) = 0.30] than those based on nondrugs (prediction of data set 2 based on the training set from data set 1, RMSE = 50.3 degrees C, r(2) = 0.20). The optimized model yields an average RMSE as low as 46.2 degrees C (r(2) = 0.49) for data set 1, and an average RMSE of 42.2 degrees C (r(2) = 0.42) for data set 2. It is shown that the kNN method inherently introduces a systematic error in melting point prediction. Much of the remaining error can be attributed to the lack of information about interactions in the liquid state, which are not well-captured by molecular descriptors.
引用
收藏
页码:2412 / 2422
页数:11
相关论文
共 54 条
[1]   MELTING-POINT, BOILING-POINT, AND SYMMETRY [J].
ABRAMOWITZ, R ;
YALKOWSKY, SH .
PHARMACEUTICAL RESEARCH, 1990, 7 (09) :942-947
[2]   Three-dimensional QSAR using the k-nearest neighbor method and its interpretation [J].
Ajmani, S ;
Jadhav, K ;
Kulkarni, SA .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2006, 46 (01) :24-31
[3]   APPLICATIONS OF NEURAL NETWORKS IN QUANTITATIVE STRUCTURE-ACTIVITY-RELATIONSHIPS OF DIHYDROFOLATE-REDUCTASE INHIBITORS [J].
ANDREA, TA ;
KALAYEH, H .
JOURNAL OF MEDICINAL CHEMISTRY, 1991, 34 (09) :2824-2836
[4]  
[Anonymous], 2005, R LANG ENV STAT COMP
[5]   Structure-based classification of active and inactive estrogenic compounds by decision tree, LVQ and kNN methods [J].
Asikainen, A ;
Kolehmainen, M ;
Ruuskanen, J ;
Tuppurainen, K .
CHEMOSPHERE, 2006, 62 (04) :658-673
[6]   Validation tools for variable subset regression [J].
Baumann, K ;
Stiefl, N .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2004, 18 (7-9) :549-562
[7]   Cross-validation as the objective function for variable-selection techniques [J].
Baumann, K .
TRAC-TRENDS IN ANALYTICAL CHEMISTRY, 2003, 22 (06) :395-406
[8]   Molecular similarity: a key technique in molecular informatics [J].
Bender, A ;
Glen, RC .
ORGANIC & BIOMOLECULAR CHEMISTRY, 2004, 2 (22) :3204-3218
[9]   Discussion of measures of enrichment in virtual screening: Comparing the information content of descriptors with increasing levels of sophistication [J].
Bender, A ;
Glen, RC .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2005, 45 (05) :1369-1375
[10]   Molecular Descriptors influencing melting point and their role in classification of solid drugs [J].
Bergström, CAS ;
Norinder, U ;
Luthman, K ;
Artursson, P .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2003, 43 (04) :1177-1185