Hybrid Class Balancing Approach for Chemical Compound Toxicity Prediction

被引:0
作者
Santiago-Gonzalez, Felipe [1 ]
Martinez-Rodriguez, Jose L. [2 ]
Garcia-Perez, Carlos [3 ]
Juarez-Saldivar, Alfredo [4 ]
Camacho-Cruz, Hugo E. [5 ]
机构
[1] Autonomous Univ Tamaulipas, Multidisciplinary Acad Unit Reynosa Rodhe, Ciudad Victoria, Mexico
[2] Autonomous Univ Tamaulipas, Fac Engn & Sci, Ciudad Victoria, Mexico
[3] Helmholtz Zentrum Munchen, Digitalizat & Transformat Dept, Ingolstadter Landstr 1, Munich, Germany
[4] Autonomous Univ Tamaulipas, Multidisciplinary Acad Unit Reynosa Aztlan, Ciudad Victoria, Mexico
[5] FMM Autonomous Univ Tamaulipas, Sendero Nacl KM 3H, Matamoros, Mexico
关键词
Class balancing; oversampling; toxicity prediction; Tox21; classification; undersampling; SMOTE; FEATURES;
D O I
10.2174/0115734099315538240909101737
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Introduction Computational methods are crucial for efficient and cost-effective drug toxicity prediction. Unfortunately the data used for prediction is often imbalanced resulting in biased models that favor the majority class. This paper proposes an approach to apply a hybrid class balancing technique and evaluate its performance on computational models for toxicity prediction in Tox21 datasetsMethods The process begins by converting chemical compound data structures (SMILES strings) from various bioassay datasets into molecular descriptors that can be processed by algorithms. Subsequently Undersampling and Oversampling techniques are applied in two different schemes on the training data. In the first scheme (Individual) only one balancing technique (Oversampling or Undersampling) is used. In the second scheme (Hybrid) the training data is divided according to a ratio (e.g. 90-10) applying a different balancing technique to each proportion. We considered eight resampling techniques (four Oversampling and four Undersampling) six molecular descriptors (based on MACCS ECFP and Mordred) and five classification models (KNN MLP RF XGB and SVM) over 10 bioassay datasets to determine the configurations that yield the best performanceResults We defined three testing scenarios: without balancing techniques (baseline) Individual and Hybrid. We found that using the ENN technique in the MACCS-MLP combination resulted in a 10.01% improvement in performance. The increase for ECFP6-2048 was 16.47% after incorporating a combination of the SMOTE (10%) and RUS (90%) techniques. Meanwhile using the same combination of techniques MORDRED-XGB showed the most significant increase in performance achieving a 22.62% improvementConclusion Integrating any of the class balancing schemes resulted in a minimum of 10.01% improvement in prediction performance compared to the best baseline configuration. In this study Undersampling techniques were more appropriate due to the significant overlap among samples. By eliminating specific samples from the predominant class that are close to the minority class this overlap is greatly reduced
引用
收藏
页数:15
相关论文
共 75 条
  • [1] Hybrid Undersampling and Oversampling for Handling Imbalanced Credit Card Data
    Alamri, Maram
    Ykhlef, Mourad
    [J]. IEEE ACCESS, 2024, 12 : 14050 - 14060
  • [2] A KNIME Workflow to Assist the Analogue Identification for Read-Across, Applied to Aromatase Activity
    Alfonso, Ana Yisel Caballero
    Chayawan, Chayawan
    Gadaleta, Domenico
    Roncaglioni, Alessandra
    Benfenati, Emilio
    [J]. MOLECULES, 2023, 28 (04):
  • [3] [Anonymous], 2020, In vitro drug interaction studies-Cytochrome P450 enzyme- and transporter-mediated drug interactions: Guidance for industry
  • [4] [Anonymous], 2020, Mechanistic Model-Based Methods for DDI Prediction
  • [5] Arwatchananukul Sujitra, 2022, IAENG International Journal of Computer Science, V49, P1
  • [6] Effective data-balancing methods for class-imbalanced genotoxicity datasets using machine learning algorithms and molecular fingerprints
    Bae, Su-Yong
    Lee, Jonga
    Jeong, Jaeseong
    Lim, Changwon
    Choi, Jinhee
    [J]. COMPUTATIONAL TOXICOLOGY, 2021, 20
  • [7] ProTox 3.0: a webserver for the prediction of toxicity of chemicals
    Banerjee, Priyanka
    Kemmler, Emanuel
    Dunkel, Mathias
    Preissner, Robert
    [J]. NUCLEIC ACIDS RESEARCH, 2024, 52 (W1) : W513 - W520
  • [8] Artificial Intelligence for Drug Toxicity and Safety
    Basile, Anna O.
    Yahi, Alexandre
    Tatonetti, Nicholas P.
    [J]. TRENDS IN PHARMACOLOGICAL SCIENCES, 2019, 40 (09) : 624 - 635
  • [9] Basurto N., 2020, 15 INT C SOFT COMP M, P366
  • [10] Batista G. E., 2004, ACM SIGKDD explorations newsletter, V6, P20, DOI [10.1145/1007730.1007735, 10.1145/1007730.1007735.2]