On some aspects of variable selection for partial least squares regression models

被引:717
作者
Roy, Partha Pratim [1 ]
Roy, Kunal [1 ]
机构
[1] Jadavpur Univ, Drug Theoret & Cheminformat Lab, Div Med & Pharmaceut Chem, Dept Pharmaceut Technol,Fac Engn & Technol, Kolkata 700032, India
来源
QSAR & COMBINATORIAL SCIENCE | 2008年 / 27卷 / 03期
关键词
crossvalidated R-2; PLS; predictive R-2; QSAR; validation; training and test sets;
D O I
10.1002/qsar.200710043
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
This paper tries to explore the optimum variable selection strategy for Partial Least Squares (PLS) regression using a model dataset of cytoprotection data. The compounds of the dataset were classified using K-means clustering technique applied on standardized descriptor matrix and ten combinations of training and test sets were generated based on the obtained clusters. For a particular training set, PLS models were developed with a number of components optimized by leave-one-out Q(2) and then the developed models were validated (externally) using the test set compounds. For each set, PLS model was initially constructed using all descriptors (variables). The variables having least standardized values of regression coefficients were deleted and the next model was developed with a reduced set of variables. These steps were performed several times until further reduction in number of variables did not improve Q(2) value. In each case, statistical parameters like predictive R-2 (R-2 pred), squared correlation coefficient between observed and predicted values with (r(2)) and without (r(0)(2)) intercept and Root Mean Square Error of Prediction (RMSEP) were calculated from the test set compounds. In case of all ten sets, Q(2) values steadily increase on deletion of variables while R-pred(2) values do not show any specific trend. In no case, the highest Q(2) and highest R-pred(2) appear in the same trial, i.e., with the same combinations of variables. This suggests that from the viewpoint of external predictability, choice of variables for PLS based on Q(2) value may not be optimum. Moreover, a clear separation of r(2) and r(0)(2) curves in some sets suggests that such models may not be truly predictive in spite of acceptable R-pred(2) values. Another observation is that coefficient of determination R-2 for the training set is more immune to changes on deletion of variables than the validation parameters like Q(2) and R-pred(2). Finally, a new parameter r(m)(2) has been suggested to indicate external predictability of QSAR models.
引用
收藏
页码:302 / 313
页数:12
相关论文
共 30 条
[1]  
*ACC INC, CER VERS 4 8 PROD
[2]  
[Anonymous], 1988, Journal of chemometrics
[3]  
[Anonymous], MINITAB STAT SOFTW
[4]   The better predictive model:: High q2 for the training set or low root mean square error of prediction for the test set? [J].
Aptula, AO ;
Jeliazkova, NG ;
Schultz, TW ;
Cronin, MTD .
QSAR & COMBINATORIAL SCIENCE, 2005, 24 (03) :385-396
[5]   Boosted leave-many-out cross-validation: the effect of training and test set diversity on PLS statistics [J].
Clark, RD .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2003, 17 (02) :265-275
[6]  
Debnath Asim Kumar, 2001, P73
[7]  
Downs G. M., 1995, ADV COMPUTER ASSISTE, P111
[8]  
Eriksson L., 2013, MULTI MEGAVARIATE DA
[9]  
Everitt BS., 2001, CLUSTER ANAL
[10]   Quantitative structure-antitumor activity relationships of camptothecin analogues: Cluster analysis and genetic algorithm-based studies [J].
Fan, Y ;
Shi, LM ;
Kohn, KW ;
Pommier, Y ;
Weinstein, JN .
JOURNAL OF MEDICINAL CHEMISTRY, 2001, 44 (20) :3254-3263