Exploring the impact of size of training sets for the development of predictive QSAR models

被引:280
|
作者
Roy, Partha Pratim [1 ]
Leonard, J. Thomas [2 ]
Roy, Kunal [1 ]
机构
[1] Jadavpur Univ, Dept Pharmaceut Technol, Div Med Chem & Pharmaceut, Drug Theoret & Cheminformat Lab, Kolkata 700032, India
[2] KM Coll Pharm, Dept Pharmaceut Chem, Madurai 625107, Tamil Nadu, India
关键词
QSAR; validation; training set size; K-means clusters; stepwise regression; FA-MLR; PLS;
D O I
10.1016/j.chemolab.2007.07.004
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
While building a predictive quantitative structure-activity relationship (QSAR), validation of the developed model is a very important task. However, a truly new set of data being often unavailable for checking predictability and robustness of the developed model, a typical external validation in QSAR studies is commonly performed by splitting the available data into training and test sets. In the present work we have attempted to explore the impact of training set size on the quality of prediction using different topological descriptors and three different statistical techniques. Three different data sets of moderate size have been used for the present study: cytoprotection data of anti-HIV thiocarbamates (n=62), HIV reverse transcriptase inhibition data of 1-[(2-hydroxyethoxy)methyl]-6-(phenylthio)thymine (HEPT) derivatives (n=107) and bioconcentration factor data of diverse functional compounds (n=122). In each case, the data set was divided into different combinations of training and test sets maintaining different size ratios in several iterations. In cases of the first two data sets, significant impact of reduction of training set size was found on the predictive ability of the models while the first data set showing higher dependence on the size than the second one. However, in case of modeling of bioconcentration factor, no significant impact of training set size on the quality of prediction could be found. Hence, no general rule can be formulated regarding the impact of training set size on the quality of prediction. Optimum size of the training set should be set based on a particular data set and types of descriptors and statistical analysis being used. (c) 2007 Elsevier B.V. All rights reserved.
引用
收藏
页码:31 / 42
页数:12
相关论文
共 50 条
  • [41] Development of QSAR models to predict estrogenic, carcinogenic, and cancer protective effects of phytoestrogens
    Singh, AK
    CANCER INVESTIGATION, 2001, 19 (02) : 201 - 216
  • [42] PVLOO-Based Training Set Selection Improves the External Predictability of QSAR/QSPR Models
    Dong, Ying
    Xiang, Bingren
    Du, Ding
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2017, 57 (05) : 1055 - 1067
  • [43] Impact of Training Set Size and Lead Time on Early Tomato Crop Mapping Accuracy
    Croci, Michele
    Impollonia, Giorgio
    Blandinieres, Henri
    Colauzzi, Michele
    Amaducci, Stefano
    REMOTE SENSING, 2022, 14 (18)
  • [44] The rm2 metrics and regression through origin approach: Reliable and useful validation tools for predictive QSAR models (Commentary on 'Is regression through origin useful in external validation of QSAR models?')
    Roy, Kunal
    Kar, Supratik
    EUROPEAN JOURNAL OF PHARMACEUTICAL SCIENCES, 2014, 62 : 111 - 114
  • [45] A review of recent advances towards the development of QSAR models for toxicity assessment of ionic liquids
    Abramenko, Natalia
    Kustov, Leonid
    Metelytsia, Larysa
    Kovalishyn, Vasyl
    Tetko, Igor
    Peijnenburg, Willie
    JOURNAL OF HAZARDOUS MATERIALS, 2020, 384
  • [46] Development of QSAR based GFA predictive model for the effective design of a new bispyrazole derivative corrosion inhibitor
    Elsamman, A.
    Khaled, K. F.
    Halim, Shimaa Abdel
    Abdelshafi, N. S.
    JOURNAL OF MOLECULAR STRUCTURE, 2023, 1293
  • [47] Development of Anti-HIV Activity Models of Lysine Sulfonamide Analogs: A QSAR Perspective
    Muthukumaran, Rajagopalan
    Sangeetha, Balasubramanian
    Amutha, Ramaswamy
    Mathur, Premendu P.
    CURRENT COMPUTER-AIDED DRUG DESIGN, 2012, 8 (01) : 70 - 82
  • [48] Development of QSAR-models for classification and prediction of baseline toxicity and of uncoupling of energy transduction
    Escher, Beate
    ALTEX-ALTERNATIVEN ZU TIEREXPERIMENTEN, 2007, 24 : 79 - 80
  • [49] Cloud 3D-QSAR: a web tool for the development of quantitative structure-activity relationship models in drug discovery
    Wang, Yu-Liang
    Wang, Fan
    Shi, Xing-Xing
    Jia, Chen-Yang
    Wu, Feng-Xu
    Hao, Ge-Fei
    Yang, Guang-Fu
    BRIEFINGS IN BIOINFORMATICS, 2021, 22 (04)
  • [50] Development of classification- and regression-based QSAR models and in silico screening of skin sensitisation potential of diverse organic chemicals
    Nandy, Ashis
    Kar, Supratik
    Roy, Kunal
    MOLECULAR SIMULATION, 2014, 40 (04) : 261 - 274