Adding Stochastic Negative Examples into Machine Learning Improves Molecular Bioactivity Prediction

被引:16
作者
Caceres, Elena L. [1 ]
Mew, Nicholas C. [1 ]
Keiser, Michael J. [1 ]
机构
[1] Univ Calif San Francisco, Kavli Inst Fundamental Neurosci, Bakar Computat Hlth Sci Inst, Inst Neurodegenerat Dis,Dept Pharmaceut Chem,Dept, San Francisco, CA 94143 USA
基金
美国国家科学基金会;
关键词
NEURAL-NETWORKS; VALIDATION;
D O I
10.1021/acs.jcim.0c00565
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Multitask deep neural networks learn to predict ligand-target binding by example, yet public pharmacological data sets are sparse, imbalanced, and approximate. We constructed two hold-out benchmarks to approximate temporal and drug-screening test scenarios, whose characteristics differ from a random split of conventional training data sets. We developed a pharmacological data set augmentation procedure, Stochastic Negative Addition (SNA), which randomly assigns untested molecule-target pairs as transient negative examples during training. Under the SNA procedure, drug-screening benchmark performance increases from R-2 = 0.1926 +/- 0.0186 to 0.4269 +/- 0.0272 (122%). This gain was accompanied by a modest decrease in the temporal benchmark (13%). SNA increases in drug-screening performance were consistent for classification and regression tasks and outperformed y-randomized controls. Our results highlight where data and feature uncertainty may be problematic and how leveraging uncertainty into training improves predictions of drug-target relationships.
引用
收藏
页码:5957 / 5970
页数:14
相关论文
共 55 条
  • [1] A Simple Representation of Three-Dimensional Molecular Structure
    Axen, Seth D.
    Huang, Xi-Ping
    Caceres, Elena L.
    Gendelev, Leo
    Roth, Bryan L.
    Keiser, Michael J.
    [J]. JOURNAL OF MEDICINAL CHEMISTRY, 2017, 60 (17) : 7393 - 7409
  • [2] The ChEMBL bioactivity database: an update
    Bento, A. Patricia
    Gaulton, Anna
    Hersey, Anne
    Bellis, Louisa J.
    Chambers, Jon
    Davies, Mark
    Krueger, Felix A.
    Light, Yvonne
    Mak, Lora
    McGlinchey, Shaun
    Nowotka, Michal
    Papadatos, George
    Santos, Rita
    Overington, John P.
    [J]. NUCLEIC ACIDS RESEARCH, 2014, 42 (D1) : D1083 - D1090
  • [3] Dealing with a data dilemma
    Bradley, David
    [J]. NATURE REVIEWS DRUG DISCOVERY, 2008, 7 (08) : 632 - 633
  • [4] A systematic study of the class imbalance problem in convolutional neural networks
    Buda, Mateusz
    Maki, Atsuto
    Mazurowski, Maciej A.
    [J]. NEURAL NETWORKS, 2018, 106 : 249 - 259
  • [5] Deep learning approaches in predicting ADMET properties
    Caceres, Elena L.
    Tudor, Matthew
    Cheng, Alan C.
    [J]. FUTURE MEDICINAL CHEMISTRY, 2020, 12 (22) : 1995 - 1999
  • [6] SMOTE: Synthetic minority over-sampling technique
    Chawla, Nitesh V.
    Bowyer, Kevin W.
    Hall, Lawrence O.
    Kegelmeyer, W. Philip
    [J]. 2002, American Association for Artificial Intelligence (16)
  • [7] Adversarial Controls for Scientific Machine Learning
    Chuang, Kangway V.
    Keiser, Michael J.
    [J]. ACS CHEMICAL BIOLOGY, 2018, 13 (10) : 2819 - 2821
  • [8] Effect of missing data on multitask prediction methods
    de Leon, Antonio de la Vega
    Chen, Beining
    Gillet, Valerie J.
    [J]. JOURNAL OF CHEMINFORMATICS, 2018, 10
  • [9] Similarity-based machine learning methods for predicting drug-target interactions: a brief review
    Ding, Hao
    Takigawa, Ichigaku
    Mamitsuka, Hiroshi
    Zhu, Shanfeng
    [J]. BRIEFINGS IN BIOINFORMATICS, 2014, 15 (05) : 734 - 747
  • [10] PotentialNet for Molecular Property Prediction
    Feinberg, Evan N.
    Sur, Debnil
    Wu, Zhenqin
    Husic, Brooke E.
    Mai, Huanghao
    Li, Yang
    Sun, Saisai
    Yang, Jianyi
    Ramsundar, Bharath
    Pande, Vijay S.
    [J]. ACS CENTRAL SCIENCE, 2018, 4 (11) : 1520 - 1530