Extreme Gradient Boosting as a Method for Quantitative Structure-Activity Relationships

被引:378
作者
Sheridan, Robert P. [1 ]
Wang, Wei Min [2 ]
Liaw, Andy [3 ]
Ma, Junshui [3 ]
Gifford, Eric M. [4 ]
机构
[1] Merck & Co Inc, Modeling & Informat Dept, 126 E Lincoln Ave, Rahway, NJ 07065 USA
[2] MSD Int GmbH, Singapore Branch, Data Sci Dept, 1 Fusionopolis Pl,06-10-07-18 Galaxis, Singapore 138522, Singapore
[3] Merck & Co Inc, Biometr Res Dept, 126 E Lincoln Ave, Rahway, NJ 07065 USA
[4] MSD Int GmbH, Bioinformat Dept, Singapore Branch, 1 Fusionopolis Pl,06-10-07-18 Galaxis, Singapore 138522, Singapore
关键词
COMPOUND CLASSIFICATION; RANDOM FOREST; CLASSIFIERS; TOOL;
D O I
10.1021/acs.jcim.6b00591
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
In the pharmaceutical industry it is common to generate many QSAR models from training sets containing a large number of molecules and a large-number of descriptors. The best QSAR methods are those that can generate the most accurate predictions but that are not overly expensive computationally. In this paper we compare eXtreme Gradient Boosting (XGBoost) to random forest and single-task deep neural nets on 30 in-house data sets. While XGBoost has many adjustable parameters, we can define a set of standard parameters at which XGBoost makes predictions, on the average, better than those of random forest and almost as good as those of deep neural nets. The biggest strength of XGBoost is its speed. Whereas efficient use of random forest requires generating each tree in parallel on a duster, and deep neural nets are usually run on GPUs, XGBoost can be tun on a single CPU in less than a third of the wall-clock time of either of the other methods.
引用
收藏
页码:2353 / 2360
页数:8
相关论文
共 18 条
[1]  
[Anonymous], ARXIV14061231STATML
[2]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[3]   Contemporary QSAR classifiers compared [J].
Bruce, Craig L. ;
Melville, James L. ;
Pickett, Stephen D. ;
Hirst, Jonathan D. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2007, 47 (01) :219-227
[4]   ATOM PAIRS AS MOLECULAR-FEATURES IN STRUCTURE ACTIVITY STUDIES - DEFINITION AND APPLICATIONS [J].
CARHART, RE ;
SMITH, DH ;
VENKATARAGHAVAN, R .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1985, 25 (02) :64-73
[5]  
Chen T., 2016, PROC ACM SIGKDD INT, P785
[6]  
CORTES C, 1995, MACH LEARN, V20, P273, DOI 10.1023/A:1022627411411
[7]   Comparing the Influence of Simulated Experimental Errors on 12 Machine Learning Algorithms in Bioactivity Modeling Using 12 Diverse Data Sets [J].
Cortes-Ciriano, Isidro ;
Bender, Andreas ;
Malliavin, Therese E. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2015, 55 (07) :1413-1425
[8]  
Fernández-Delgado M, 2014, J MACH LEARN RES, V15, P3133
[9]   Greedy function approximation: A gradient boosting machine [J].
Friedman, JH .
ANNALS OF STATISTICS, 2001, 29 (05) :1189-1232
[10]   Stochastic gradient boosting [J].
Friedman, JH .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2002, 38 (04) :367-378