Effect of Data Quality and Data Quantity on the Estimation of Intrinsic Solubility: Analysis Based on a Single-Source Data Set

被引:0
作者
Zhao, Jiaxi [1 ]
Hermans, Eline [2 ]
Sepassi, Kia [3 ]
Tistaert, Christophe [2 ]
Bergstrom, Christel A. S. [1 ]
Ahmad, Mazen [4 ]
Larsson, Per [1 ]
机构
[1] Uppsala Univ, Dept Pharm, S-75123 Uppsala, Sweden
[2] Janssen Pharmaceut NV, Pharmaceut & Mat Sci, B-2340 Beerse, Belgium
[3] Janssen Res & Dev LLC, Discovery Pharmaceut, La Jolla, CA 92121 USA
[4] Janssen Pharmaceut NV, In Silico Discovery, B-2340 Beerse, Belgium
关键词
solubility prediction; machine learning; quantitativestructure-property relationship (QSPR); intrinsicsolubility; data quality; AQUEOUS SOLUBILITY; DRUG SOLUBILITY; PREDICTION;
D O I
10.1021/acs.molpharmaceut.4c00685
中图分类号
R-3 [医学研究方法]; R3 [基础医学];
学科分类号
1001 ;
摘要
Aqueous solubility is one of the most important physicochemical properties of drug molecules and a major driving force for oral drug absorption. To date, the performance of in silico models for the estimation of solubility for novel chemical space is limited. To investigate possible reasons and remedies for this, the Johnson and Johnson in-house aqueous solubility data with over 40,000 compounds was leveraged. All data were generated through the same high-throughput assay, providing a unique opportunity to explore the relationship between data quality, quantity, and model estimations. Six intrinsic solubility data sets with different sizes and noise levels were generated by making use of three different approaches: (i) inclusion or exclusion of amorphous solid residue, (ii) measured or experimental log D to identify the intrinsic solubility, and (iii) adopting or omitting a quality check process in the data processing workflow. A random forest regressor was trained on the data sets with three different sets of descriptors calculated from RDKit, ADMET predictor, or Mordred, and the performances were evaluated with nested cross-validation as well as ten refined test sets. The models confirm, as expected, that with the same data set size, high-quality data leads to better model performance; however, also, models trained with larger data sets containing analytical variability can give equally accurate estimations compared to models trained with small, clean, and diverse data sets. However, noise introduced by including the presence of amorphous solid postsolubility measurement in the training data set cannot be overcome by increasing data size, as they are introducing a biased systematic positive error in the data set, confirming the importance of critical data review. Finally, two top-performing models were tested on the first test set from the second solubility challenge, achieving RMSE values of 0.74 and 0.72 and log S +/- 0.5 of 46 and 48%, respectively. These results demonstrated improved performance compared to those reported in the findings of the competition, highlighting that a single-source curated data set can enhance the prediction of intrinsic solubility.
引用
收藏
页码:5261 / 5271
页数:11
相关论文
共 50 条
  • [31] Biomass Estimation and Saturation Value Determination Based on Multi-Source Remote Sensing Data
    Sa, Rula
    Nie, Yonghui
    Chumachenko, Sergey
    Fan, Wenyi
    REMOTE SENSING, 2024, 16 (12)
  • [32] An Empirical Analysis of Three-stage Data-Preprocessing for Analogy-based Software Effort Estimation on the ISBSG Data
    Huang, Jianglin
    Li, Yan-Fu
    Keung, Jacky Wai
    Yu, Y. T.
    Chan, W. K.
    2017 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY (QRS), 2017, : 442 - 449
  • [33] Data Flow Construction and Quality Evaluation of Electronic Source Data in Clinical Trials: Pilot Study Based on Hospital Electronic Medical Records in China
    Yuan, Yannan
    Mei, Yun
    Zhao, Shuhua
    Dai, Shenglong
    Liu, Xiaohong
    Sun, Xiaojing
    Fu, Zhiying
    Zhou, Liheng
    Ai, Jie
    Ma, Liheng
    Jiang, Min
    JMIR MEDICAL INFORMATICS, 2024, 12
  • [34] Household travel mode choice estimation with large-scale data-an empirical analysis based on mobility data in Milan
    Liang, Leilei
    Xu, Meng
    Grant-Muller, Susan
    Mussone, Lorenzo
    INTERNATIONAL JOURNAL OF SUSTAINABLE TRANSPORTATION, 2020, 15 (01) : 70 - 85
  • [35] HASSO: A Highly-Automated Source Selection and Ordering System Based on Data Quality Factors
    Yousfi, Aola
    Hafid El Yazidi, Moulay
    Zellou, Ahmed
    ICACSIS 2020: 2020 12TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER SCIENCE AND INFORMATION SYSTEMS (ICACSIS), 2020, : 155 - 163
  • [36] Set-membership estimation from poor quality data sets: Modelling ammonia volatilisation in flooded rice systems
    Nurulhuda, K.
    Struik, P. C.
    Keesman, K. J.
    ENVIRONMENTAL MODELLING & SOFTWARE, 2017, 88 : 138 - 150
  • [37] Analysis of anesthesia screens for rule-based data quality assessment opportunities
    Wang Z.
    Penning M.
    Zozus M.
    Studies in Health Technology and Informatics, 2019, 257 : 473 - 478
  • [38] Multi-source Data Analysis Method of Exhibition Site Based on Mobile Internet
    Yin, Xiaoyin
    He, Jiangnan
    Gao, Ying
    Li, Jingxian
    IWCMC 2021: 2021 17TH INTERNATIONAL WIRELESS COMMUNICATIONS & MOBILE COMPUTING CONFERENCE (IWCMC), 2021, : 41 - 44
  • [39] A Novel Online Estimation Scheme for Static Voltage Stability Margin Based on Relationships Exploration in a Large Data Set
    Fan, Youping
    Liu, Songkai
    Qin, Libin
    Li, Huimin
    Qiu, Huimin
    IEEE TRANSACTIONS ON POWER SYSTEMS, 2015, 30 (03) : 1380 - 1393
  • [40] An empirical analysis of data preprocessing for machine learning-based software cost estimation
    Huang, Jianglin
    Li, Yan-Fu
    Xie, Min
    INFORMATION AND SOFTWARE TECHNOLOGY, 2015, 67 : 108 - 127