Exploring the impact of size of training sets for the development of predictive QSAR models

被引:280
|
作者
Roy, Partha Pratim [1 ]
Leonard, J. Thomas [2 ]
Roy, Kunal [1 ]
机构
[1] Jadavpur Univ, Dept Pharmaceut Technol, Div Med Chem & Pharmaceut, Drug Theoret & Cheminformat Lab, Kolkata 700032, India
[2] KM Coll Pharm, Dept Pharmaceut Chem, Madurai 625107, Tamil Nadu, India
关键词
QSAR; validation; training set size; K-means clusters; stepwise regression; FA-MLR; PLS;
D O I
10.1016/j.chemolab.2007.07.004
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
While building a predictive quantitative structure-activity relationship (QSAR), validation of the developed model is a very important task. However, a truly new set of data being often unavailable for checking predictability and robustness of the developed model, a typical external validation in QSAR studies is commonly performed by splitting the available data into training and test sets. In the present work we have attempted to explore the impact of training set size on the quality of prediction using different topological descriptors and three different statistical techniques. Three different data sets of moderate size have been used for the present study: cytoprotection data of anti-HIV thiocarbamates (n=62), HIV reverse transcriptase inhibition data of 1-[(2-hydroxyethoxy)methyl]-6-(phenylthio)thymine (HEPT) derivatives (n=107) and bioconcentration factor data of diverse functional compounds (n=122). In each case, the data set was divided into different combinations of training and test sets maintaining different size ratios in several iterations. In cases of the first two data sets, significant impact of reduction of training set size was found on the predictive ability of the models while the first data set showing higher dependence on the size than the second one. However, in case of modeling of bioconcentration factor, no significant impact of training set size on the quality of prediction could be found. Hence, no general rule can be formulated regarding the impact of training set size on the quality of prediction. Optimum size of the training set should be set based on a particular data set and types of descriptors and statistical analysis being used. (c) 2007 Elsevier B.V. All rights reserved.
引用
收藏
页码:31 / 42
页数:12
相关论文
共 50 条
  • [21] On Various Metrics Used for Validation of Predictive QSAR Models with Applications in Virtual Screening and Focused Library Design
    Roy, Kunal
    Mitra, Indrani
    COMBINATORIAL CHEMISTRY & HIGH THROUGHPUT SCREENING, 2011, 14 (06) : 450 - 474
  • [22] Filter feature selectors in the development of binary QSAR models
    Cerruela Garcia, G.
    Perez-Parras Toledano, J.
    de Haro Garcia, A.
    Garcia-Pedrajas, N.
    SAR AND QSAR IN ENVIRONMENTAL RESEARCH, 2019, 30 (05) : 313 - 345
  • [23] A tool for the calculation of molecular descriptors in the development of QSAR models
    Ruiz, Irene Luque
    Gomez-Nieto, Miguel Angel
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2008, PT 1, PROCEEDINGS, 2008, 5072 : 986 - 996
  • [24] Combining QSAR classification models for predictive modeling of human monoamine oxidase inhibitors
    Helguera, Aliuska Morales
    Perez-Garrido, Alfonso
    Gaspar, Alexandra
    Reis, Joana
    Cagide, Fernando
    Vina, Dolores
    Cordeiro, M. Natalia D. S.
    Borges, Fernanda
    EUROPEAN JOURNAL OF MEDICINAL CHEMISTRY, 2013, 59 : 75 - 90
  • [25] Highly predictive hologram QSAR models of nitrile-containing cruzain inhibitors
    Silva, Daniel Gedder
    Rocha, Josmar Rodrigues
    Sartori, Geraldo Rodrigues
    Montanari, Carlos Alberto
    JOURNAL OF BIOMOLECULAR STRUCTURE & DYNAMICS, 2017, 35 (15) : 3232 - 3249
  • [26] Development of improved QSAR models for predicting the outcome of the in vivo micronucleus genetic toxicity assay
    Yoo, Jae Wook
    Kruhlak, Naomi L.
    Landry, Curran
    Cross, Kevin P.
    Sedykh, Alexander
    Stavitskaya, Lidiya
    REGULATORY TOXICOLOGY AND PHARMACOLOGY, 2020, 113
  • [27] Critically Assessing the Predictive Power of QSAR Models for Human Liver Microsomal Stability
    Liu, Ruifeng
    Schyman, Patric
    Wallqvist, Anders
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2015, 55 (08) : 1566 - 1575
  • [28] Exploring Dimensionality Reduction Techniques for Deep Learning Driven QSAR Models of Mutagenicity
    Kalian, Alexander D.
    Benfenati, Emilio
    Osborne, Olivia J.
    Gott, David
    Potter, Claire
    Dorne, Jean-Lou C. M.
    Guo, Miao
    Hogstrand, Christer
    TOXICS, 2023, 11 (07)
  • [29] Comparative QSAR Analysis of 3,5-bis (Arylidene)-4-Piperidone Derivatives: the Development of Predictive Cytotoxicity Models
    Edraki, Najmeh
    Das, Umashankar
    Hemateenejad, Bahram
    Dimmock, Jonathan R.
    Miri, Ramin
    IRANIAN JOURNAL OF PHARMACEUTICAL RESEARCH, 2016, 15 (02): : 425 - 437
  • [30] QSAR models for soil ecotoxicity: Development and validation of models to predict reproductive toxicity of organic chemicals in the collembola Folsomia candida
    Lavado, Giovanna J.
    Baderna, Diego
    Carnesecchi, Edoardo
    Toropova, Alla P.
    Toropov, Andrey A.
    Dorne, Jean Lou C. M.
    Benfenati, Emilio
    JOURNAL OF HAZARDOUS MATERIALS, 2022, 423