Modelling methods and cross-validation variants in QSAR: a multi-level analysis

被引:40
作者
Racz, A. [1 ]
Bajusz, D. [2 ]
Heberger, K. [1 ]
机构
[1] Hungarian Acad Sci, Res Ctr Nat Sci, Plasma Chem Res Grp, Budapest, Hungary
[2] Hungarian Acad Sci, Res Ctr Nat Sci, Med Chem Res Grp, Budapest, Hungary
关键词
QSAR; toxicity; validation; SRD; cross-validation; MLR; PLS; PCR; SVM; ANN; BENZENE-DERIVATIVES; BIOLOGICAL-ACTIVITY; RANKING DIFFERENCES; PREDICTION; PARAMETERS; TOXICITY; PRINCIPLES; INHIBITORS; REGRESSION; 3D-QSAR;
D O I
10.1080/1062936X.2018.1505778
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Prediction performance often depends on the cross- and test validation protocols applied. Several combinations of different cross-validation variants and model-building techniques were used to reveal their complexity. Two case studies (acute toxicity data) were examined, applying five-fold cross-validation (with random, contiguous and Venetian blind forms) and leave-one-out cross-validation (CV). External test sets showed the effects and differences between the validation protocols. The models were generated with multiple linear regression (MLR), principal component regression (PCR), partial least squares (PLS) regression, artificial neural networks (ANN) and support vector machines (SVM). The comparisons were made by the sum of ranking differences (SRD) and factorial analysis of variance (ANOVA). The largest bias and variance could be assigned to the MLR method and contiguous block cross-validation. SRD can provide a unique and unambiguous ranking of methods and CV variants. Venetian blind cross-validation is a promising tool. The generated models were also compared based on their basic performance parameters (r(2) and Q(2)). MLR produced the largest gap, while PCR gave the smallest. Although PCR is the best validated and balanced technique, SVM always outperformed the other methods, when experimental values were the benchmark. Variable selection was advantageous, and the modelling had a larger influence than CV variants.
引用
收藏
页码:661 / 674
页数:14
相关论文
共 35 条
[21]   CORRELATION OF BIOLOGICAL ACTIVITY OF PHENOXYACETIC ACIDS WITH HAMMETT SUBSTITUENT CONSTANTS AND PARTITION COEFFICIENTS [J].
HANSCH, C ;
MALONEY, PP ;
FUJITA, T .
NATURE, 1962, 194 (4824) :178-&
[22]   Applicability domain: towards a more formal definition [J].
Hanser, T. ;
Barber, C. ;
Marchaland, J. F. ;
Werner, S. .
SAR AND QSAR IN ENVIRONMENTAL RESEARCH, 2016, 27 (11) :865-881
[23]  
Hastie T., 2009, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, V2nd, P241, DOI DOI 10.1007/978-0-387-84858-7
[24]   Assessing model fit by cross-validation [J].
Hawkins, DM ;
Basak, SC ;
Mills, D .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2003, 43 (02) :579-586
[25]   The problem of overfitting [J].
Hawkins, DM .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2004, 44 (01) :1-12
[26]  
Héberger K, 2017, CHALL ADV COMPUT CHE, V24, P89, DOI 10.1007/978-3-319-56850-8_3
[27]   Evaluation of single-cell gel electrophoresis data: Combination of variance analysis with sum of ranking differences [J].
Heberger, Karoly ;
Kolarevic, Stoimir ;
Kracun-Kolarevic, Margareta ;
Sunjog, Karolina ;
Gacic, Zoran ;
Kljajic, Zoran ;
Mitric, Milena ;
Vukovic-Gacic, Branka .
MUTATION RESEARCH-GENETIC TOXICOLOGY AND ENVIRONMENTAL MUTAGENESIS, 2014, 771 :15-22
[28]   Sum of ranking differences compares methods or models fairly [J].
Heberger, Karoly .
TRAC-TRENDS IN ANALYTICAL CHEMISTRY, 2010, 29 (01) :101-109
[29]   Method and model comparison by sum of ranking differences in cases of repeated observations (ties) [J].
Kollar-Hunek, Klara ;
Heberger, Karoly .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2013, 127 :139-146
[30]  
Kubinyi H., 1993, METHODS PRINCIPLES M