Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity o Predictive Models Based on Imbalanced Chemical Data Sets

被引:124
作者
Banerjee, Priyanka [1 ]
Dehnbostel, Frederic O. [1 ]
Preissner, Robert [1 ]
机构
[1] Charite Univ Med Berlin, Inst Physiol, Struct Bioinformat Grp, Berlin, Germany
来源
FRONTIERS IN CHEMISTRY | 2018年 / 6卷
关键词
machine learning; DILI; sampling methods; Tox21; imbalanced data; molecular fingerprints; sensitivity-specificity balance; SMOTE; INDUCED LIVER-INJURY; CLASSIFICATION;
D O I
10.3389/fchem.2018.00362
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Increase in the number of new chemicals synthesized in past decades has resulted in constant growth in the development and application of computational models for prediction of activity as well as safety profiles of the chemicals. Most of the time, such computational models and its application must deal with imbalanced chemical data. It is indeed a challenge to construct a classifier using imbalanced data set. In this study, we analyzed and validated the importance of different sampling methods over non-sampling method, to achieve a well-balanced sensitivity and specificity of a machine learning model trained on imbalanced chemical data. Additionally, this study has achieved an accuracy of 93.00%, an AUC of 0.94, F1 measure of 0.90, sensitivity of 96.00% and specificity of 91.00% using SMOTE sampling and Random Forest classifier for the prediction of Drug Induced Liver Injury (DILI). Our results suggest that, irrespective of data set used, sampling methods can have major influence on reducing the gap between sensitivity and specificity of a model. This study demonstrates the efficacy of different sampling methods for class imbalanced problem using binary chemical data sets.
引用
收藏
页数:11
相关论文
共 33 条
  • [1] ProTox-II: a webserver for the prediction of toxicity of chemicals
    Banerjee, Priyanka
    Eckert, Andreas O.
    Schrey, Anna K.
    Preissner, Robert
    [J]. NUCLEIC ACIDS RESEARCH, 2018, 46 (W1) : W257 - W263
  • [2] BitterSweet Forest: A Random Forest Based Binary Classifier to Predict Bitterness and Sweetness of Chemical Compounds
    Banerjee, Priyanka
    Preissner, Robert
    [J]. FRONTIERS IN CHEMISTRY, 2018, 6
  • [3] Computational methods for prediction of in vitro effects of new chemical structures
    Banerjee, Priyanka
    Siramshetty, Vishal B.
    Drwal, Malgorzata N.
    Preissner, Robert
    [J]. JOURNAL OF CHEMINFORMATICS, 2016, 8 : 1 - 11
  • [4] Classifying imbalanced data sets using similarity based hierarchical decomposition
    Beyan, Cigdem
    Fisher, Robert
    [J]. PATTERN RECOGNITION, 2015, 48 (05) : 1653 - 1672
  • [5] QSAR Modeling of Tox21 Challenge Stress Response and Nuclear Receptor Signaling Toxicity Assays
    Capuzzi, Stephen J.
    Politi, Regina
    Isayev, Olexandr
    Farag, Sherif
    Tropsha, Alexander
    [J]. FRONTIERS IN ENVIRONMENTAL SCIENCE, 2016, 4
  • [6] DILIrank: the largest reference drug list ranked by the risk for developing drug-induced liver injury in humans
    Chen, Minjun
    Suzuki, Ayako
    Thakkar, Shraddha
    Yu, Ke
    Hu, Chuchu
    Tong, Weida
    [J]. DRUG DISCOVERY TODAY, 2016, 21 (04) : 648 - 653
  • [7] Random Balance: Ensembles of variable priors classifiers for imbalanced data
    Diez-Pastor, Jose F.
    Rodriguez, Juan J.
    Garcia-Osorio, Cesar
    Kuncheva, Ludmila I.
    [J]. KNOWLEDGE-BASED SYSTEMS, 2015, 85 : 96 - 111
  • [8] Molecular similarity-based predictions of the Tox21 screening outcome
    Drwal, Malgorzata N.
    Siramshetty, Vishal B.
    Banerjee, Priyanka
    Goede, Andrean
    Preissner, Robert
    Dunkel, Mathias
    [J]. FRONTIERS IN ENVIRONMENTAL SCIENCE, 2015, 3
  • [9] Analysis of sampling techniques for imbalanced data: An n=648 ADNI study
    Dubey, Rashmi
    Zhou, Jiayu
    Wang, Yalin
    Thompson, Paul M.
    Ye, Jieping
    [J]. NEUROIMAGE, 2014, 87 : 220 - 241
  • [10] Random forests for verbal autopsy analysis: multisite validation study using clinical diagnostic gold standards
    Flaxman, Abraham D.
    Vahdatpour, Alireza
    Green, Sean
    James, Spencer L.
    Murray, Christopher J. L.
    [J]. POPULATION HEALTH METRICS, 2011, 9