Boosting: An ensemble learning tool for compound classification and QSAR modeling

被引:161
作者
Svetnik, V
Wang, T
Tong, C
Liaw, A
Sheridan, RP
Song, QH
机构
[1] Merck Res Labs, Biometr Res & Mol Syst, Rahway, NJ 07065 USA
[2] Univ Wisconsin, Dept Stat, Madison, WI 53706 USA
关键词
D O I
10.1021/ci0500379
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
A classification and regression tool, J. H. Friedman's Stochastic Gradient Boosting (SGB), is applied to predicting a compound's quantitative or categorical biological activity based on a quantitative description of the compound's molecular structure. Stochastic Gradient Boosting is a procedure for building a sequence of models, for instance regression trees (as in this paper), whose outputs are combined to form a predicted quantity, either an estimate of the biological activity, or a class label to which a molecule belongs. In particular, the SGB procedure builds a model in a stage-wise manner by fitting each tree to the gradient of a loss function: e.g., squared error for regression and binomial log-likelihood for classification. The values of the gradient are computed for each sample in the training set, but only a random sample of these gradients is used at each stage. (Friedman showed that the well-known boosting algorithm, AdaBoost of Freund and Schapire, could be considered as a particular case of SGB.) The SGB method is used to analyze 10 cheminformatics data sets, most of which are publicly available. The results show that SGB's performance is comparable to that of Random Forest, another ensemble learning method, and are generally competitive with or superior to those of other QSAR methods. The use of SGB's variable importance with partial dependence plots for model interpretation is also illustrated.
引用
收藏
页码:786 / 799
页数:14
相关论文
共 55 条
[21]   A decision-theoretic generalization of on-line learning and an application to boosting [J].
Freund, Y ;
Schapire, RE .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1997, 55 (01) :119-139
[22]  
Friedman J., 2001, ELEMENTS STAT LEARNI, V1
[23]  
Friedman J.H., Importance sampled learning ensembles
[24]   Greedy function approximation: A gradient boosting machine [J].
Friedman, JH .
ANNALS OF STATISTICS, 2001, 29 (05) :1189-1232
[25]   Stochastic gradient boosting [J].
Friedman, JH .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2002, 38 (04) :367-378
[26]  
Fukunaga K., 1990, INTRO STAT PATTERN R
[27]   NOVEL PIPERIDINE SIGMA RECEPTOR LIGANDS AS POTENTIAL ANTIPSYCHOTIC-DRUGS [J].
GILLIGAN, PJ ;
CAIN, GA ;
CHRISTOS, TE ;
COOK, L ;
DRUMMOND, S ;
JOHNSON, AL ;
KERGAYE, AA ;
MCELROY, JF ;
ROHRBACH, KW ;
SCHMIDT, WK ;
TAM, SW .
JOURNAL OF MEDICINAL CHEMISTRY, 1992, 35 (23) :4344-4361
[28]  
Hawkins DM, 1998, COMP SCI STAT, V30, P534
[29]   Improving the classification accuracy in chemistry via boosting technique [J].
He, P ;
Xu, CJ ;
Liang, YZ ;
Fang, KT .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2004, 70 (01) :39-46
[30]  
Ho TK, 1998, IEEE T PATTERN ANAL, V20, P832, DOI 10.1109/34.709601