Effect of Data Quality and Data Quantity on the Estimation of Intrinsic Solubility: Analysis Based on a Single-Source Data Set

被引:0
作者
Zhao, Jiaxi [1 ]
Hermans, Eline [2 ]
Sepassi, Kia [3 ]
Tistaert, Christophe [2 ]
Bergstrom, Christel A. S. [1 ]
Ahmad, Mazen [4 ]
Larsson, Per [1 ]
机构
[1] Uppsala Univ, Dept Pharm, S-75123 Uppsala, Sweden
[2] Janssen Pharmaceut NV, Pharmaceut & Mat Sci, B-2340 Beerse, Belgium
[3] Janssen Res & Dev LLC, Discovery Pharmaceut, La Jolla, CA 92121 USA
[4] Janssen Pharmaceut NV, In Silico Discovery, B-2340 Beerse, Belgium
关键词
solubility prediction; machine learning; quantitativestructure-property relationship (QSPR); intrinsicsolubility; data quality; AQUEOUS SOLUBILITY; DRUG SOLUBILITY; PREDICTION;
D O I
10.1021/acs.molpharmaceut.4c00685
中图分类号
R-3 [医学研究方法]; R3 [基础医学];
学科分类号
1001 ;
摘要
Aqueous solubility is one of the most important physicochemical properties of drug molecules and a major driving force for oral drug absorption. To date, the performance of in silico models for the estimation of solubility for novel chemical space is limited. To investigate possible reasons and remedies for this, the Johnson and Johnson in-house aqueous solubility data with over 40,000 compounds was leveraged. All data were generated through the same high-throughput assay, providing a unique opportunity to explore the relationship between data quality, quantity, and model estimations. Six intrinsic solubility data sets with different sizes and noise levels were generated by making use of three different approaches: (i) inclusion or exclusion of amorphous solid residue, (ii) measured or experimental log D to identify the intrinsic solubility, and (iii) adopting or omitting a quality check process in the data processing workflow. A random forest regressor was trained on the data sets with three different sets of descriptors calculated from RDKit, ADMET predictor, or Mordred, and the performances were evaluated with nested cross-validation as well as ten refined test sets. The models confirm, as expected, that with the same data set size, high-quality data leads to better model performance; however, also, models trained with larger data sets containing analytical variability can give equally accurate estimations compared to models trained with small, clean, and diverse data sets. However, noise introduced by including the presence of amorphous solid postsolubility measurement in the training data set cannot be overcome by increasing data size, as they are introducing a biased systematic positive error in the data set, confirming the importance of critical data review. Finally, two top-performing models were tested on the first test set from the second solubility challenge, achieving RMSE values of 0.74 and 0.72 and log S +/- 0.5 of 46 and 48%, respectively. These results demonstrated improved performance compared to those reported in the findings of the competition, highlighting that a single-source curated data set can enhance the prediction of intrinsic solubility.
引用
收藏
页码:5261 / 5271
页数:11
相关论文
共 50 条
  • [41] "Sweet-in-Green" Systems Based on Sugars and Ionic Liquids: New Solubility Data and Thermodynamic Analysis
    Paduszynski, Kamil
    Okuniewski, Marcin
    Domanska, Urszula
    INDUSTRIAL & ENGINEERING CHEMISTRY RESEARCH, 2013, 52 (51) : 18482 - 18491
  • [42] An analysis on combined GPS/COMPASS data quality and its effect on single point positioning accuracy under different observing conditions
    Cai, Changsheng
    Gao, Yang
    Pan, Lin
    Dai, Wujiao
    ADVANCES IN SPACE RESEARCH, 2014, 54 (05) : 818 - 829
  • [43] A Soft Sensor Model of Sintering Process Quality Index Based on Multi-Source Data Fusion
    Li, Yuxuan
    Jiang, Weihao
    Shi, Zhihui
    Yang, Chunjie
    SENSORS, 2023, 23 (10)
  • [44] GIS-BASED LAND COVER ANALYSIS AND PREDICTION BASED ON OPEN-SOURCE SOFTWARE AND DATA
    Dawid, Wojciech
    Bielecka, Elzbieta
    QUAESTIONES GEOGRAPHICAE, 2022, 41 (03) : 75 - 86
  • [45] An approach for incorporating quality-based cost–benefit analysis in data warehouse design
    Lila Rao
    Kweku-Muata Osei-Bryson
    Information Systems Frontiers, 2008, 10 : 361 - 373
  • [46] Refined landslide susceptibility analysis based on InSAR technology and UAV multi-source data
    Cao, Chen
    Zhu, Kuanxing
    Xu, Peihua
    Shan, Bo
    Yang, Guang
    Song, Shengyuan
    JOURNAL OF CLEANER PRODUCTION, 2022, 368
  • [47] Du-Bus: A Realtime Bus Waiting Time Estimation System Based On Multi-Source Data
    Rong, Yuecheng
    Xu, Zhimian
    Liu, Jun
    Liu, Hao
    Ding, Jian
    Liu, Xuanyu
    Luo, Wei
    Zhang, Chuanming
    Gao, Jiaxiang
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2022, 23 (12) : 24524 - 24539
  • [48] Estimation of water quality variables based on machine learning model and cluster analysis-based empirical model using multi-source remote sensing data in inland reservoirs, South China
    Tian, Di
    Zhao, Xinfeng
    Gao, Lei
    Liang, Zuobing
    Yang, Zaizhi
    Zhang, Pengcheng
    Wu, Qirui
    Ren, Kun
    Li, Rui
    Yang, Chenchen
    Li, Shaoheng
    Wang, Meng
    He, Zhidong
    Zhang, Zebin
    Chen, Jianyao
    ENVIRONMENTAL POLLUTION, 2024, 342
  • [49] Feature Extraction and Learning Effect Analysis for MOOCs Users Based on Data Mining
    Li, Yajuan
    INTERNATIONAL JOURNAL OF EMERGING TECHNOLOGIES IN LEARNING, 2018, 13 (10): : 108 - 120
  • [50] Feature Extraction and Learning Effect Analysis for MOOCS Users Based on Data Mining
    Yang, Biqin
    Qu, Zhi
    EDUCATIONAL SCIENCES-THEORY & PRACTICE, 2018, 18 (05): : 1138 - 1149