Effect of Data Quality and Data Quantity on the Estimation of Intrinsic Solubility: Analysis Based on a Single-Source Data Set

被引:0
作者
Zhao, Jiaxi [1 ]
Hermans, Eline [2 ]
Sepassi, Kia [3 ]
Tistaert, Christophe [2 ]
Bergstrom, Christel A. S. [1 ]
Ahmad, Mazen [4 ]
Larsson, Per [1 ]
机构
[1] Uppsala Univ, Dept Pharm, S-75123 Uppsala, Sweden
[2] Janssen Pharmaceut NV, Pharmaceut & Mat Sci, B-2340 Beerse, Belgium
[3] Janssen Res & Dev LLC, Discovery Pharmaceut, La Jolla, CA 92121 USA
[4] Janssen Pharmaceut NV, In Silico Discovery, B-2340 Beerse, Belgium
关键词
solubility prediction; machine learning; quantitativestructure-property relationship (QSPR); intrinsicsolubility; data quality; AQUEOUS SOLUBILITY; DRUG SOLUBILITY; PREDICTION;
D O I
10.1021/acs.molpharmaceut.4c00685
中图分类号
R-3 [医学研究方法]; R3 [基础医学];
学科分类号
1001 ;
摘要
Aqueous solubility is one of the most important physicochemical properties of drug molecules and a major driving force for oral drug absorption. To date, the performance of in silico models for the estimation of solubility for novel chemical space is limited. To investigate possible reasons and remedies for this, the Johnson and Johnson in-house aqueous solubility data with over 40,000 compounds was leveraged. All data were generated through the same high-throughput assay, providing a unique opportunity to explore the relationship between data quality, quantity, and model estimations. Six intrinsic solubility data sets with different sizes and noise levels were generated by making use of three different approaches: (i) inclusion or exclusion of amorphous solid residue, (ii) measured or experimental log D to identify the intrinsic solubility, and (iii) adopting or omitting a quality check process in the data processing workflow. A random forest regressor was trained on the data sets with three different sets of descriptors calculated from RDKit, ADMET predictor, or Mordred, and the performances were evaluated with nested cross-validation as well as ten refined test sets. The models confirm, as expected, that with the same data set size, high-quality data leads to better model performance; however, also, models trained with larger data sets containing analytical variability can give equally accurate estimations compared to models trained with small, clean, and diverse data sets. However, noise introduced by including the presence of amorphous solid postsolubility measurement in the training data set cannot be overcome by increasing data size, as they are introducing a biased systematic positive error in the data set, confirming the importance of critical data review. Finally, two top-performing models were tested on the first test set from the second solubility challenge, achieving RMSE values of 0.74 and 0.72 and log S +/- 0.5 of 46 and 48%, respectively. These results demonstrated improved performance compared to those reported in the findings of the competition, highlighting that a single-source curated data set can enhance the prediction of intrinsic solubility.
引用
收藏
页码:5261 / 5271
页数:11
相关论文
共 50 条
  • [21] Quality-Based Framework for Requirement Analysis in Data Warehouse
    Munawar
    Salim, Naomie
    Ibrahim, Roliana
    2014 INTERNATIONAL CONFERENCE OF ADVANCED INFORMATICS: CONCEPT, THEORY AND APPLICATION (ICAICTA), 2014, : 152 - 158
  • [22] Analysis and evaluation of heat source data of large-scale heating system based on descriptive data mining techniques
    Huang, Ke
    Yuan, Jianjuan
    Zhou, Zhihua
    Zheng, Xuejing
    ENERGY, 2022, 251
  • [23] Data mining based quality analysis on informants involved applied research
    Jinlou Xie
    Jianjian Luo
    Qingyuan Zhou
    Cluster Computing, 2016, 19 : 1885 - 1893
  • [24] Synthetic Data Generation using Diffusion Models for ML-based Lightpath Quality of Transmission Estimation Under Extreme Data Scarcity
    Andreoletti, Davide
    Rottondi, Cristina
    Ayoub, Omran
    Bianco, Andrea
    2024 24TH INTERNATIONAL CONFERENCE ON TRANSPARENT OPTICAL NETWORKS, ICTON 2024, 2024,
  • [25] Data mining based quality analysis on informants involved applied research
    Xie, Jinlou
    Luo, Jianjian
    Zhou, Qingyuan
    Cluster Computing-The Journal of Networks Software Tools and Applications, 2016, 19 (04): : 1885 - 1893
  • [26] Does single-source create an added value? Evaluating the impact of introducing x4T into the clinical routine on workflow modifications, data quality and cost-benefit
    Bruland, Philipp
    Forster, Christian
    Breil, Bernhard
    Staender, Sonja
    Dugas, Martin
    Fritz, Fleur
    INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2014, 83 (12) : 915 - 928
  • [27] An Approach to Rapidly Evaluating Rock Mass Quality in Underground Engineering Based on Multi-source Heterogeneous Data
    He, Peng
    Chen, Yan
    Jiang, Feng
    Wang, Gang
    Jiang, Yujing
    ROCK MECHANICS AND ROCK ENGINEERING, 2025, 58 (01) : 1295 - 1325
  • [28] Estimation of water quality parameters based on time series hydrometeorological data in Miaowan Island
    Zheng, Yuanning
    Li, Cai
    Zhang, Xianqing
    Zhao, Wei
    Yang, Zeming
    Cao, Wenxi
    ECOLOGICAL INDICATORS, 2024, 159
  • [29] Data Analysis Platform using Open Source based Deep Learning Engine
    Kim, Ahyoung
    Lee, Junwoo
    2018 INTERNATIONAL CONFERENCE ON PLATFORM TECHNOLOGY AND SERVICE (PLATCON18), 2018, : 6 - 10
  • [30] County Scale Corn Yield Estimation Based on Multi-Source Data in Liaoning Province
    Qu, Ge
    Shuai, Yanmin
    Shao, Congying
    Peng, Xiuyuan
    Huang, Jiapeng
    AGRONOMY-BASEL, 2023, 13 (05):