Greedy and Linear Ensembles of Machine Learning Methods Outperform Single Approaches for QSPR Regression Problems

被引:12
作者
Kew, William [1 ,2 ]
Mitchell, John B. O. [1 ,2 ]
机构
[1] Univ St Andrews, Biomed Sci Res Complex, St Andrews KY16 9ST, Fife, Scotland
[2] Univ St Andrews, EaStCHEM Sch Chem, St Andrews KY16 9ST, Fife, Scotland
关键词
Machine Learning; Quantitative structure-property relationships; Greedy ensembles; Linear ensembles; SUPPORT VECTOR MACHINE; MELTING-POINT; AQUEOUS SOLUBILITY; DRUG-DISCOVERY; PREDICTION; QSAR; CLASSIFICATION; QUALITY; LIPOPHILICITY; ALGORITHMS;
D O I
10.1002/minf.201400122
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
The application of Machine Learning to cheminformatics is a large and active field of research, but there exist few papers which discuss whether ensembles of different Machine Learning methods can improve upon the performance of their component methodologies. Here we investigated a variety of methods, including kernel-based, tree, linear, neural networks, and both greedy and linear ensemble methods. These were all tested against a standardised methodology for regression with data relevant to the pharmaceutical development process. This investigation focused on QSPR problems within drug-like chemical space. We aimed to investigate which methods perform best, and how the 'wisdom of crowds' principle can be applied to ensemble predictors. It was found that no single method performs best for all problems, but that a dynamic, well-structured ensemble predictor would perform very well across the board, usually providing an improvement in performance over the best single method. Its use of weighting factors allows the greedy ensemble to acquire a bigger contribution from the better performing models, and this helps the greedy ensemble generally to outperform the simpler linear ensemble. Choice of data preprocessing methodology was found to be crucial to performance of each method too.
引用
收藏
页码:634 / 647
页数:14
相关论文
共 65 条
[1]  
[Anonymous], 2003, ARXIV PREPRINT CS030
[2]  
[Anonymous], ART EXCAVATING DATA
[3]   The influence of lipophilicity in drug discovery and design [J].
Arnott, John A. ;
Planey, Sonia Lobo .
EXPERT OPINION ON DRUG DISCOVERY, 2012, 7 (10) :863-875
[4]   Molecular Descriptors influencing melting point and their role in classification of solid drugs [J].
Bergström, CAS ;
Norinder, U ;
Luthman, K ;
Artursson, P .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2003, 43 (04) :1177-1185
[5]   Prediction of melting points of organic compounds using extreme learning machines [J].
Bhat, Akshay U. ;
Merchant, Shamel S. ;
Bhagwat, Sunil S. .
INDUSTRIAL & ENGINEERING CHEMISTRY RESEARCH, 2008, 47 (03) :920-925
[6]  
BLACK P., 2005, Dictionary of Algorithms and Data Structures
[7]   New Ideas about the Solubility of Drugs [J].
Box, Karl ;
Comer, John E. ;
Gravestock, Tom ;
Stuart, Martin .
CHEMISTRY & BIODIVERSITY, 2009, 6 (11) :1767-1788
[8]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[9]   Algorithms for chemoinformatics [J].
Brown, Nathan .
WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE, 2011, 1 (05) :716-726
[10]   Chemoinformatics-An Introduction for Computer Scientists [J].
Brown, Nathan .
ACM COMPUTING SURVEYS, 2009, 41 (02) :1-38