Exploring the impact of size of training sets for the development of predictive QSAR models

被引:280
|
作者
Roy, Partha Pratim [1 ]
Leonard, J. Thomas [2 ]
Roy, Kunal [1 ]
机构
[1] Jadavpur Univ, Dept Pharmaceut Technol, Div Med Chem & Pharmaceut, Drug Theoret & Cheminformat Lab, Kolkata 700032, India
[2] KM Coll Pharm, Dept Pharmaceut Chem, Madurai 625107, Tamil Nadu, India
关键词
QSAR; validation; training set size; K-means clusters; stepwise regression; FA-MLR; PLS;
D O I
10.1016/j.chemolab.2007.07.004
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
While building a predictive quantitative structure-activity relationship (QSAR), validation of the developed model is a very important task. However, a truly new set of data being often unavailable for checking predictability and robustness of the developed model, a typical external validation in QSAR studies is commonly performed by splitting the available data into training and test sets. In the present work we have attempted to explore the impact of training set size on the quality of prediction using different topological descriptors and three different statistical techniques. Three different data sets of moderate size have been used for the present study: cytoprotection data of anti-HIV thiocarbamates (n=62), HIV reverse transcriptase inhibition data of 1-[(2-hydroxyethoxy)methyl]-6-(phenylthio)thymine (HEPT) derivatives (n=107) and bioconcentration factor data of diverse functional compounds (n=122). In each case, the data set was divided into different combinations of training and test sets maintaining different size ratios in several iterations. In cases of the first two data sets, significant impact of reduction of training set size was found on the predictive ability of the models while the first data set showing higher dependence on the size than the second one. However, in case of modeling of bioconcentration factor, no significant impact of training set size on the quality of prediction could be found. Hence, no general rule can be formulated regarding the impact of training set size on the quality of prediction. Optimum size of the training set should be set based on a particular data set and types of descriptors and statistical analysis being used. (c) 2007 Elsevier B.V. All rights reserved.
引用
收藏
页码:31 / 42
页数:12
相关论文
共 50 条
  • [1] On selection of training and test sets for the development of predictive QSAR models
    Leonard, JT
    Roy, K
    QSAR & COMBINATORIAL SCIENCE, 2006, 25 (03): : 235 - 251
  • [2] Impact assessment of the rational selection of training and test sets on the predictive ability of QSAR models
    Andrada, M. F.
    Vega-Hissi, E. G.
    Estrada, M. R.
    Garro Martinez, J. C.
    SAR AND QSAR IN ENVIRONMENTAL RESEARCH, 2017, 28 (12) : 1011 - 1023
  • [3] Exploring QSAR studies on 4-substituted quinazoline derivatives as antimalarial compounds for the development of predictive models
    Mishra, Mitali
    Mishra, Vikash Kumar
    Senger, Parul
    Pathak, Anupam Kumar
    Kashaw, Sushil K.
    MEDICINAL CHEMISTRY RESEARCH, 2014, 23 (03) : 1397 - 1405
  • [4] Exploring QSAR studies on 4-substituted quinazoline derivatives as antimalarial compounds for the development of predictive models
    Mitali Mishra
    Vikash Kumar Mishra
    Parul Senger
    Anupam Kumar Pathak
    Sushil K. Kashaw
    Medicinal Chemistry Research, 2014, 23 : 1397 - 1405
  • [5] Predictive QSAR Models for the Toxicity of Disinfection Byproducts
    Qin, Litang
    Zhang, Xin
    Chen, Yuhan
    Mo, Lingyun
    Zeng, Honghu
    Liang, Yanpeng
    MOLECULES, 2017, 22 (10)
  • [6] Exploring predictive QSAR models for hepatocyte toxicity of phenols using QTMS descriptors
    Roy, Kunal
    Popelier, Paul L. A.
    BIOORGANIC & MEDICINAL CHEMISTRY LETTERS, 2008, 18 (08) : 2604 - 2609
  • [7] Development of a phospholipidosis database and predictive quantitative structure-activity relationship (QSAR) models
    Kruhlak, Naomi L.
    Choi, Sydney S.
    Contrera, Joseph F.
    Weaver, James L.
    Willard, James M.
    Hastings, Kenneth L.
    Sancilio, Lawrence F.
    TOXICOLOGY MECHANISMS AND METHODS, 2008, 18 (2-3) : 217 - 227
  • [8] On Two Novel Parameters for Validation of Predictive QSAR Models
    Roy, Partha Pratim
    Paul, Somnath
    Mitra, Indrani
    Roy, Kunal
    MOLECULES, 2009, 14 (05) : 1660 - 1701
  • [9] Exploring QSAR models for assessment of acute fish toxicity of environmental transformation products of pesticides (ETPPs)
    Pandey, Sapna Kumari
    Ojha, Probir Kumar
    Roy, Kunal
    CHEMOSPHERE, 2020, 252
  • [10] Investigating the influence of data splitting on the predictive ability of QSAR/QSPR models
    Puzyn, Tomasz
    Mostrag-Szlichtyng, Aleksandra
    Gajewicz, Agnieszka
    Skrzynski, Michal
    Worth, Andrew P.
    STRUCTURAL CHEMISTRY, 2011, 22 (04) : 795 - 804