Boosting: An ensemble learning tool for compound classification and QSAR modeling

被引:161
作者
Svetnik, V
Wang, T
Tong, C
Liaw, A
Sheridan, RP
Song, QH
机构
[1] Merck Res Labs, Biometr Res & Mol Syst, Rahway, NJ 07065 USA
[2] Univ Wisconsin, Dept Stat, Madison, WI 53706 USA
关键词
D O I
10.1021/ci0500379
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
A classification and regression tool, J. H. Friedman's Stochastic Gradient Boosting (SGB), is applied to predicting a compound's quantitative or categorical biological activity based on a quantitative description of the compound's molecular structure. Stochastic Gradient Boosting is a procedure for building a sequence of models, for instance regression trees (as in this paper), whose outputs are combined to form a predicted quantity, either an estimate of the biological activity, or a class label to which a molecule belongs. In particular, the SGB procedure builds a model in a stage-wise manner by fitting each tree to the gradient of a loss function: e.g., squared error for regression and binomial log-likelihood for classification. The values of the gradient are computed for each sample in the training set, but only a random sample of these gradients is used at each stage. (Friedman showed that the well-known boosting algorithm, AdaBoost of Freund and Schapire, could be considered as a particular case of SGB.) The SGB method is used to analyze 10 cheminformatics data sets, most of which are publicly available. The results show that SGB's performance is comparable to that of Random Forest, another ensemble learning method, and are generally competitive with or superior to those of other QSAR methods. The use of SGB's variable importance with partial dependence plots for model interpretation is also illustrated.
引用
收藏
页码:786 / 799
页数:14
相关论文
共 55 条
[1]  
[Anonymous], LIBSVM LIB SUPPORT V
[2]  
[Anonymous], 2004, USING RANDOM FOREST
[3]   Classification of multidrug-resistance reversal agents using structure-based descriptors and linear discriminant analysis [J].
Bakken, GA ;
Jurs, PC .
JOURNAL OF MEDICINAL CHEMISTRY, 2000, 43 (23) :4534-4541
[4]   The properties of known drugs .1. Molecular frameworks [J].
Bemis, GW ;
Murcko, MA .
JOURNAL OF MEDICINAL CHEMISTRY, 1996, 39 (15) :2887-2893
[5]   Properties of known drugs. 2. Side chains [J].
Bemis, GW ;
Murcko, MA .
JOURNAL OF MEDICINAL CHEMISTRY, 1999, 42 (25) :5095-5099
[6]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[7]   Informative library design as an efficient strategy to identify and optimize leads: Application to cyclin-dependent kinase 2 antagonists [J].
Bradley, EK ;
Miller, JL ;
Saiah, E ;
Grootenhuis, PDJ .
JOURNAL OF MEDICINAL CHEMISTRY, 2003, 46 (20) :4360-4364
[8]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[9]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[10]  
Breiman L, 1998, ANN STAT, V26, P801