Effect of Data Quality and Data Quantity on the Estimation of Intrinsic Solubility: Analysis Based on a Single-Source Data Set

被引:0
作者
Zhao, Jiaxi [1 ]
Hermans, Eline [2 ]
Sepassi, Kia [3 ]
Tistaert, Christophe [2 ]
Bergstrom, Christel A. S. [1 ]
Ahmad, Mazen [4 ]
Larsson, Per [1 ]
机构
[1] Uppsala Univ, Dept Pharm, S-75123 Uppsala, Sweden
[2] Janssen Pharmaceut NV, Pharmaceut & Mat Sci, B-2340 Beerse, Belgium
[3] Janssen Res & Dev LLC, Discovery Pharmaceut, La Jolla, CA 92121 USA
[4] Janssen Pharmaceut NV, In Silico Discovery, B-2340 Beerse, Belgium
关键词
solubility prediction; machine learning; quantitativestructure-property relationship (QSPR); intrinsicsolubility; data quality; AQUEOUS SOLUBILITY; DRUG SOLUBILITY; PREDICTION;
D O I
10.1021/acs.molpharmaceut.4c00685
中图分类号
R-3 [医学研究方法]; R3 [基础医学];
学科分类号
1001 ;
摘要
Aqueous solubility is one of the most important physicochemical properties of drug molecules and a major driving force for oral drug absorption. To date, the performance of in silico models for the estimation of solubility for novel chemical space is limited. To investigate possible reasons and remedies for this, the Johnson and Johnson in-house aqueous solubility data with over 40,000 compounds was leveraged. All data were generated through the same high-throughput assay, providing a unique opportunity to explore the relationship between data quality, quantity, and model estimations. Six intrinsic solubility data sets with different sizes and noise levels were generated by making use of three different approaches: (i) inclusion or exclusion of amorphous solid residue, (ii) measured or experimental log D to identify the intrinsic solubility, and (iii) adopting or omitting a quality check process in the data processing workflow. A random forest regressor was trained on the data sets with three different sets of descriptors calculated from RDKit, ADMET predictor, or Mordred, and the performances were evaluated with nested cross-validation as well as ten refined test sets. The models confirm, as expected, that with the same data set size, high-quality data leads to better model performance; however, also, models trained with larger data sets containing analytical variability can give equally accurate estimations compared to models trained with small, clean, and diverse data sets. However, noise introduced by including the presence of amorphous solid postsolubility measurement in the training data set cannot be overcome by increasing data size, as they are introducing a biased systematic positive error in the data set, confirming the importance of critical data review. Finally, two top-performing models were tested on the first test set from the second solubility challenge, achieving RMSE values of 0.74 and 0.72 and log S +/- 0.5 of 46 and 48%, respectively. These results demonstrated improved performance compared to those reported in the findings of the competition, highlighting that a single-source curated data set can enhance the prediction of intrinsic solubility.
引用
收藏
页码:5261 / 5271
页数:11
相关论文
共 50 条
  • [1] Software productivity analysis of a large data set and issues of confidentiality and data quality
    Liebchen, GA
    Shepperd, M
    2005 11th International Symposium on Software Metrics (METRICS), 2005, : 393 - 395
  • [2] The effect of data quality on model performance with application to daily evaporation estimation
    Wu, Ming-Chang
    Lin, Gwo-Fong
    Lin, Hsuan-Yu
    STOCHASTIC ENVIRONMENTAL RESEARCH AND RISK ASSESSMENT, 2013, 27 (07) : 1661 - 1671
  • [3] Data-based groundwater quality estimation and uncertainty analysis for irrigation agriculture
    Yu, Haijiao
    Wen, Xiaohu
    Wu, Min
    Sheng, Danrui
    Wu, Jun
    Zhao, Ying
    AGRICULTURAL WATER MANAGEMENT, 2022, 262
  • [4] Load Quality Analysis and Forecasting for Power Data Set on Cloud Platform
    Gan, Jixiang
    Liu, Qi
    Zhang, Jing
    CLOUD COMPUTING, CLOUDCOMP 2021, 2022, 430 : 3 - 16
  • [5] Estimation of Cultivated Land Quality Based on Soil Hyperspectral Data
    Lin, Chenjie
    Hu, Yueming
    Liu, Zhenhua
    Peng, Yiping
    Wang, Lu
    Peng, Dailiang
    AGRICULTURE-BASEL, 2022, 12 (01):
  • [6] Evaluation of Flight Test Data Quality Based on Rough Set Theory
    Kong Xiangwei
    2020 13TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, BIOMEDICAL ENGINEERING AND INFORMATICS (CISP-BMEI 2020), 2020, : 1053 - 1057
  • [7] Evaluating User Experience and Data Quality in Gamified Data Collection for Appearance-Based Gaze Estimation
    Yue, Mingtao
    Sayuda, Tomomi
    Pennington, Miles
    Sugano, Yusuke
    INTERNATIONAL JOURNAL OF HUMAN-COMPUTER INTERACTION, 2024,
  • [8] Traffic Condition Estimation Based on Historical Data Analysis
    Ha Mai Tan
    Hoang-Nam Pham-Nguyen
    Quang Tran Minh
    Phat Nguyen Huu
    IEEE ICCE 2020: 2020 IEEE EIGHTH INTERNATIONAL CONFERENCE ON COMMUNICATIONS AND ELECTRONICS (ICCE), 2021, : 256 - 261
  • [9] Data Quality in Secondary Data Analysis: A Case Study of Ecological Data using a Semiotic-based Approach
    Kwiatkowska, Mila
    Pouw, Frank
    PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE ON DATA SCIENCE, TECHNOLOGY AND APPLICATIONS (DATA), 2019, : 377 - 384
  • [10] Study on the Estimation of Forest Volume Based on Multi-Source Data
    Hu, Tao
    Sun, Yuman
    Jia, Weiwei
    Li, Dandan
    Zou, Maosheng
    Zhang, Mengku
    SENSORS, 2021, 21 (23)