Development of Ensemble Learning Method Considering Applicability Domains

被引:0
作者
Sato, Keigo [1 ]
Kaneko, Hiromasa [1 ]
机构
[1] Meiji Univ, Sch Sci & Technol, Dept Appl Chem, Tokyo, Japan
关键词
Ensemble learning; Regression; Applicability domain; QSAR; QSPR; MODELS; PREDICTION; REGRESSION;
D O I
10.2477/jccj.2019-0010
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
In quantitative structure-activity relationship and quantitative structure-physical relationship quantitatively, regression models are constructed activities and properties y, and molecular descriptors x for compounds. To improve predictive performance of models, multiple sub-models are constructed and a final y-value is predicted by integrating y-values predicted with sub-models in ensemble learning. Although it was confirmed that predictive performance improved by considering the applicability domain (AD) of each sub-model and by using only the sub-models inside AD, ADs cannot be compared between sub-datasets with different x. It was impossible to predict a y-value by selecting and weighting sub-models for a new sample. In this study, we focused on the similarity-weighted root-mean-square distance (wRMSD), which is an index of AD, and developed wRMSD-based AD considering ensemble learning (WEL), an ensemble learning method based on wRMSD. Since wRMSD is represented as the scale of y, AD can be compared between sub-models with different x, and thus, it is possible to predict a y-value, weighting sub-models having low wRMSD-values, which means high reliability of prediction, for a new sample. It was confirmed that AD was enlarged and predictive performance improved by using WEL compared to the conventional ensemble learning method through data analysis using three datasets of compounds for which water solubility, toxicity and pharmacological activity were measured. Python code for WEL is available at https://github.com/hkaneko1985/wel.
引用
收藏
页码:187 / 193
页数:7
相关论文
共 17 条
[1]   The One-Class Classification Approach to Data Description and to Models Applicability Domain [J].
Baskin, Igor I. ;
Kireeva, Natalia ;
Varnek, Alexandre .
MOLECULAR INFORMATICS, 2010, 29 (8-9) :581-587
[2]  
BISHOP C. M., 2006, Pattern recognition and machine learning, DOI [DOI 10.1117/1.2819119, 10.1007/978-0-387-45528-0]
[3]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794
[4]   Predicting the Predictability: A Unified Approach to the Applicability Domain Problem of QSAR Models [J].
Horvath, Dragos ;
Marcou, Gilles ;
Alexandre, Varnek .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2009, 49 (07) :1762-1776
[5]   ADME evaluation in drug discovery. 4. Prediction of aqueous solubility based on atom contribution approach [J].
Hou, TJ ;
Xia, K ;
Zhang, W ;
Xu, XJ .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2004, 44 (01) :266-275
[6]   Discussion on Regression Methods Based on Ensemble Learning and Applicability Domains of Linear Submodels [J].
Kaneko, Hiromasa .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2018, 58 (02) :480-489
[7]   Interpretable, Probability-Based Confidence Metric for Continuous Quantitative Structure-Activity Relationship Models [J].
Keefer, Christopher E. ;
Kauffman, Gregory W. ;
Gupta, Rishi Raj .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2013, 53 (02) :368-383
[8]   Random forest models to predict aqueous solubility [J].
Palmer, David S. ;
O'Boyle, Noel M. ;
Glen, Robert C. ;
Mitchell, John B. O. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2007, 47 (01) :150-158
[9]   A Short Review of the Generation of Molecular Descriptors and Their Applications in Quantitative Structure Property/Activity Relationships [J].
Sahoo, Sagarika ;
Adhikari, Chandana ;
Kuanar, Minati ;
Mishra, Bijay K. .
CURRENT COMPUTER-AIDED DRUG DESIGN, 2016, 12 (03) :181-205
[10]   Similarity to molecules in the training set is a good discriminator for prediction accuracy in QSAR [J].
Sheridan, RP ;
Feuston, BP ;
Maiorov, VN ;
Kearsley, SK .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2004, 44 (06) :1912-1928