Improved Chemical Structure-Activity Modeling Through Data Augmentation

被引:33
作者
Cortes-Ciriano, Isidro [1 ,2 ]
Bender, Andreas [3 ]
机构
[1] Inst Pasteur, Dept Biol Struct & Chim, Unite Bioinformat Struct, F-75015 Paris, France
[2] CNRS, UMR 3825, F-75015 Paris, France
[3] Univ Cambridge, Dept Chem, Ctr Mol Sci Informat, Cambridge CB2 1EW, England
关键词
CLASSICAL LEAST-SQUARES; DOMAIN APPLICABILITY; QSAR MODELS; PREDICTION; ALGORITHMS; DIVERSE;
D O I
10.1021/acs.jcim.5b00570
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Extending the original training data with simulated unobserved data points has proven powerful to increase both the generalization ability of predictive models and their robustness against changes in the structure of data (e.g., systematic drifts in the response variable) in diverse areas such as the analysis of spectroscopic data or the detection of conserved domains in protein sequences. In this contribution, we explore the effect of data augmentation in the predictive power of QSAR models, quantified by the RMSE values on the test set. We collected 8 diverse data sets from the literature and ChEMBL version 19 reporting compound activity as pIC(50) values. The original training data were replicated (i.e., augmented) N times (N is an element of 0, 1, 2, 4, 6, 8, 10), and these replications were perturbed with Gaussian noise (mu = 0, sigma = sigma(noise)) on either (i) the pIC(50) values, (ii) the compound descriptors, (iii) both the compound descriptors and the pIC(50) values, or (iv) none of them. The effect of data augmentation was evaluated across three different algorithms (RE, GBM, and SVM radial) and two descriptor types (Morgan fingerprints and physicochemical-property-based descriptors). The influence of all factor levels was analyzed with a balanced fixed-effect full-factorial experiment. Overall, data augmentation constantly led to increased predictive power on the test set by 10-15%. Injecting noise on (i) compound descriptors or on 60 both compound descriptors and pIC(50) values led to the highest drop of RMSEtest, values (from 0.67-0.72 to 0.60-0.63 pIC(50) units). The maximum increase in predictive power provided by data augmentation is reached when the training data is replicated one time. Therefore, extending the original training data with one perturbed repetition thereof represents a reasonable trade-off between the increased performance of the models and the computational cost of data augmentation, namely increase of (i) model complexity due to the need for optimizing sigma(noise) and (ii) the number of training examples.
引用
收藏
页码:2682 / 2692
页数:11
相关论文
共 48 条
[1]  
[Anonymous], 2004, Kernel methods in computational biology
[2]   Support Vector Machines and Kernels for Computational Biology [J].
Ben-Hur, Asa ;
Ong, Cheng Soon ;
Sonnenburg, Soeren ;
Schoelkopf, Bernhard ;
Raetsch, Gunnar .
PLOS COMPUTATIONAL BIOLOGY, 2008, 4 (10)
[3]   How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space [J].
Bender, Andreas ;
Jenkins, Jeremy L. ;
Scheiber, Josef ;
Sukuru, Sai Chelan K. ;
Glick, Meir ;
Davies, John W. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2009, 49 (01) :108-119
[4]   Data augmentation algorithms for detecting conserved domains in protein sequences: A comparative study [J].
Bi, Chengpeng .
JOURNAL OF PROTEOME RESEARCH, 2008, 7 (01) :192-201
[5]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[6]  
Breiman L, 1998, ANN STAT, V26, P801
[7]   Randomizing outputs to increase prediction accuracy [J].
Breiman, L .
MACHINE LEARNING, 2000, 40 (03) :229-242
[8]   Beyond the Scope of Free-Wilson Analysis: Building Interpretable QSAR Models with Machine Learning Algorithms [J].
Chen, Hongming ;
Carlsson, Lars ;
Eriksson, Mats ;
Varkonyi, Peter ;
Norinder, Ulf ;
Nilsson, Ingemar .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2013, 53 (06) :1324-1336
[9]   QSAR Modeling: Where Have You Been? Where Are You Going To? [J].
Cherkasov, Artem ;
Muratov, Eugene N. ;
Fourches, Denis ;
Varnek, Alexandre ;
Baskin, Igor I. ;
Cronin, Mark ;
Dearden, John ;
Gramatica, Paola ;
Martin, Yvonne C. ;
Todeschini, Roberto ;
Consonni, Viviana ;
Kuz'min, Victor E. ;
Cramer, Richard ;
Benigni, Romualdo ;
Yang, Chihae ;
Rathman, James ;
Terfloth, Lothar ;
Gasteiger, Johann ;
Richard, Ann ;
Tropsha, Alexander .
JOURNAL OF MEDICINAL CHEMISTRY, 2014, 57 (12) :4977-5010
[10]  
Chilimbi TrishulM., Project adam: Building an efficient and scalable deep learning training system