Exploring the impact of size of training sets for the development of predictive QSAR models

被引:280
|
作者
Roy, Partha Pratim [1 ]
Leonard, J. Thomas [2 ]
Roy, Kunal [1 ]
机构
[1] Jadavpur Univ, Dept Pharmaceut Technol, Div Med Chem & Pharmaceut, Drug Theoret & Cheminformat Lab, Kolkata 700032, India
[2] KM Coll Pharm, Dept Pharmaceut Chem, Madurai 625107, Tamil Nadu, India
关键词
QSAR; validation; training set size; K-means clusters; stepwise regression; FA-MLR; PLS;
D O I
10.1016/j.chemolab.2007.07.004
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
While building a predictive quantitative structure-activity relationship (QSAR), validation of the developed model is a very important task. However, a truly new set of data being often unavailable for checking predictability and robustness of the developed model, a typical external validation in QSAR studies is commonly performed by splitting the available data into training and test sets. In the present work we have attempted to explore the impact of training set size on the quality of prediction using different topological descriptors and three different statistical techniques. Three different data sets of moderate size have been used for the present study: cytoprotection data of anti-HIV thiocarbamates (n=62), HIV reverse transcriptase inhibition data of 1-[(2-hydroxyethoxy)methyl]-6-(phenylthio)thymine (HEPT) derivatives (n=107) and bioconcentration factor data of diverse functional compounds (n=122). In each case, the data set was divided into different combinations of training and test sets maintaining different size ratios in several iterations. In cases of the first two data sets, significant impact of reduction of training set size was found on the predictive ability of the models while the first data set showing higher dependence on the size than the second one. However, in case of modeling of bioconcentration factor, no significant impact of training set size on the quality of prediction could be found. Hence, no general rule can be formulated regarding the impact of training set size on the quality of prediction. Optimum size of the training set should be set based on a particular data set and types of descriptors and statistical analysis being used. (c) 2007 Elsevier B.V. All rights reserved.
引用
收藏
页码:31 / 42
页数:12
相关论文
共 50 条
  • [31] Structural Similarity and Descriptor Spaces for Clustering and Development of QSAR Models
    Luque Ruiz, Irene
    Cerruela Garcia, Gonzalo
    Angel Gomez-Nieto, Miguel
    CURRENT COMPUTER-AIDED DRUG DESIGN, 2013, 9 (02) : 254 - 271
  • [32] QSAR models for inhibitors of physiological impact of Escherichia coli that leads to diarrhea
    Toropov, Andrey A.
    Toropova, Alla P.
    Benfenati, Emilio
    Gini, Giuseppina
    Leszczynska, Danuta
    Leszczynski, Jerzy
    De Nucci, Gilberto
    BIOCHEMICAL AND BIOPHYSICAL RESEARCH COMMUNICATIONS, 2013, 432 (02) : 214 - 225
  • [33] Predictive QSAR models development and validation for human ether-a-go-go related gene (hERG) blockers using newer tools
    Hari Narayana Moorthy, N. S.
    Ramos, Maria J.
    Fernandes, Pedro A.
    JOURNAL OF ENZYME INHIBITION AND MEDICINAL CHEMISTRY, 2014, 29 (03) : 317 - 324
  • [34] Comparison of MC4PC and MDL-QSAR rodent carcinogenicity predictions and the enhancement of predictive performance by combining QSAR models
    Contrera, Joseph F.
    Kruhlak, Naomi L.
    Matthews, Edwin J.
    Benz, R. Daniel
    REGULATORY TOXICOLOGY AND PHARMACOLOGY, 2007, 49 (03) : 172 - 182
  • [35] Development of QSAR models for predicting anti-HIV-1 activity using the Monte Carlo method
    Toropov, Andrey A.
    Toropova, Alla P.
    Raska, Ivan, Jr.
    Benfenati, Emilio
    Gini, Giuseppina
    CENTRAL EUROPEAN JOURNAL OF CHEMISTRY, 2013, 11 (03): : 371 - 380
  • [36] Exploring the reactivity of high-valent copper species with emerging contaminants using predictive QSAR modelling
    Zhao, Tao
    Xu, Minghao
    Yang, Xuerui
    Chovelon, Jean-Marc
    Zhou, Lei
    ENVIRONMENTAL TECHNOLOGY, 2025,
  • [37] Development of QSAR models to predict blood-brain barrier permeability
    Faramarzi, Sadegh
    Kim, Marlene T. T.
    Volpe, Donna A. A.
    Cross, Kevin P. P.
    Chakravarti, Suman
    Stavitskaya, Lidiya
    FRONTIERS IN PHARMACOLOGY, 2022, 13
  • [38] Development of predictive models for nutritional assessment in the elderly
    Munoz Diaz, Belen
    Martinez De La Iglesia, Jorge
    Romero-Saldana, Manuel
    Molina-Luque, Rafael
    Arenas de Larriva, Antonio P.
    Molina-Recio, Guillermo
    PUBLIC HEALTH NUTRITION, 2021, 24 (03) : 449 - 456
  • [39] Exploring the QSAR's predictive truthfulness of the novel N-tuple discrete derivative indices on benchmark datasets
    Martinez-Santiago, O.
    Marrero-Ponce, Y.
    Vivas-Reyes, R.
    Rivera-Borroto, O. M.
    Hurtado, E.
    Treto-Suarez, M. A.
    Ramos, Y.
    Vergara-Murillo, F.
    Orozco-Ugarriza, M. E.
    Martinez-Lopez, Y.
    SAR AND QSAR IN ENVIRONMENTAL RESEARCH, 2017, 28 (05) : 367 - 389
  • [40] Support vector machines: Development of QSAR models for predicting anti-HIV-1 activity of TIBO derivatives
    Darnag, Rachid
    Mostapha Mazouz, E. L.
    Schmitzer, Andreea
    Villemin, Didier
    Jarid, Abdellah
    Cherqaoui, Driss
    EUROPEAN JOURNAL OF MEDICINAL CHEMISTRY, 2010, 45 (04) : 1590 - 1597