On selection of training and test sets for the development of predictive QSAR models

被引：215

作者：

Leonard, JT ^{[1
]}

Roy, K ^{[1
]}

机构：

[1] Jadavpur Univ, Dept Pharmaceut Technol, Drug Theoret & Cheminformat Lab, Div Med & Pharmaceut Chem, Kolkata 700032, W Bengal, India

来源：

QSAR & COMBINATORIAL SCIENCE | 2006年 / 25卷 / 03期

关键词：

QSAR; HIV protease; CCR5; antagonists; mannitol; piperidinyl amides; ureas; propylamine; validation; K-Means clusters;

D O I：

10.1002/qsar.200510161

中图分类号：

R914 [药物化学];

学科分类号：

100701 ;

摘要：

The development of predictive QSAR models depends not only oil the statistical method but also on the algorithm used for the selection of training and test sets. Here. we describe the validation of QSAR models for three data sets with different sizes (n = 35, 56 and 87) based on random division, sorted biological activity data and K-means clusters for the factor scores of the original variable matrix along with/without biological activity values. When the training and test sets were generated by random division or by the activity-range algorithm, predictive models were not obtained in most of the cases. In case of random division of the data sets into training and test sets, there is no correlation between internal and external validation statistics. However, good external validation statistics were obtained when training and test sets were selected based on K-means clusters of factor scores of the descriptor space along with/without the biological activity values. So, the selection of training and test sets should be based on the proximity of the representative points of the test set to representative points of the training set in the multidimensional descriptor space. The concept of closeness is based on the general assumption underlying all QSAR theories: similar compounds have similar activities. Thus, if one wishes to validate a QSAR model, the points of the test set must be close to the points of the training set in the multidimensional descriptor space. Based on the results of several methods for the division of the training and test sets, we propose that K-means-cluster-based division of training and prediction sets can be used as a reliable method of division of data set into training and test sets for developing predictive QSAR models.

引用

页码：235 / 251

页数：17

共 34 条

[1] [Anonymous], SPSS STAT SOFTW
[2] 1,2,5,6-tetra-O-benzyl-D-mannitol derivatives as novel HIV protease inhibitors
Bouzide, A
Sauvé, G
Sévigny, G
Yelle, J
[J]. BIOORGANIC & MEDICINAL CHEMISTRY LETTERS, 2003, 13 (20) : 3601 - 3605
[3] Modulators of the human CCR5 receptor. Part 1: Discovery and initial SAR of 1-(3,3-diphenylpropyl)-piperidinyl amides and ureas
Burrows, JN
Cumming, JG
Fillery, SM
Hamlin, GA
Hudson, JA
Jackson, RJ
McLaughlin, S
Shaw, JS
[J]. BIOORGANIC & MEDICINAL CHEMISTRY LETTERS, 2005, 15 (01) : 25 - 28
[4] *CAMBR SOFT CORP, CHEM DRAW ULTRA VERS
[5] Pitfalls in QSAR
Cronin, MTD
Schultz, TW
[J]. JOURNAL OF MOLECULAR STRUCTURE-THEOCHEM, 2003, 622 (1-2): : 39 - 51
[6] Darlington R.B., 2017, REGRESSION ANAL LINE
[7] Debnath Asim Kumar, 2001, P73
[8] Methods for reliability and uncertainty assessment and for applicability evaluations of classification- and regression-based QSARs
Eriksson, L
Jaworska, J
Worth, AP
Cronin, MTD
McDowell, RM
Gramatica, P
[J]. ENVIRONMENTAL HEALTH PERSPECTIVES, 2003, 111 (10) : 1361 - 1375
[9] Everitt BS., 2001, CLUSTER ANAL
[10] Quantitative structure-antitumor activity relationships of camptothecin analogues: Cluster analysis and genetic algorithm-based studies
Fan, Y
Shi, LM
Kohn, KW
Pommier, Y
Weinstein, JN
[J]. JOURNAL OF MEDICINAL CHEMISTRY, 2001, 44 (20) : 3254 - 3263

← 1 2 3 4 →