Large-Scale Pretraining Improves Sample Efficiency of Active Learning-Based Virtual Screening

被引:3
|
作者
Cao, Zhonglin [1 ]
Sciabola, Simone [1 ]
Wang, Ye [1 ]
机构
[1] Biogen, Med Chem, Cambridge, MA 02142 USA
关键词
MOLECULAR DOCKING; INHIBITOR; DISCOVERY; BINDING; GENERATION; DATABASE; ZINC;
D O I
10.1021/acs.jcim.3c01938
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Virtual screening of large compound libraries to identify potential hit candidates is one of the earliest steps in drug discovery. As the size of commercially available compound collections grows exponentially to the scale of billions, active learning and Bayesian optimization have recently been proven as effective methods of narrowing down the search space. An essential component of those methods is a surrogate machine learning model that predicts the desired properties of compounds. An accurate model can achieve high sample efficiency by finding hits with only a fraction of the entire library being virtually screened. In this study, we examined the performance of a pretrained transformer-based language model and graph neural network in a Bayesian optimization active learning framework. The best pretrained model identifies 58.97% of the top-50,000 compounds after screening only 0.6% of an ultralarge library containing 99.5 million compounds, improving 8% over the previous state-of-the-art baseline. Through extensive benchmarks, we show that the superior performance of pretrained models persists in both structure-based and ligand-based drug discovery. Pretrained models can serve as a boost to the accuracy and sample efficiency of active learning-based virtual screening.
引用
收藏
页码:1882 / 1891
页数:10
相关论文
共 50 条
  • [31] DFRscore: Deep Learning-Based Scoring of Synthetic Complexity with Drug-Focused Retrosynthetic Analysis for High-Throughput Virtual Screening
    Kim, Hyeongwoo
    Lee, Kyunghoon
    Kim, Chansu
    Lim, Jaechang
    Kim, Woo Youn
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2023, 64 (07) : 2432 - 2444
  • [32] Screening, large-scale production and structure-based classification of cystine-dense peptides
    Correnti, Colin E.
    Gewe, Mesfin M.
    Mehlin, Christopher
    Bandaranayake, Ashok D.
    Johnsen, William A.
    Rupert, Peter B.
    Brusniak, Mi-Youn
    Clarke, Midori
    Burke, Skyler E.
    De van der Schueren, Willem
    Pilat, Kristina
    Turnbaugh, Shanon M.
    May, Damon
    Watson, Alex
    Chan, Man Kid
    Bahl, Christopher D.
    Olson, James M.
    Strong, Roland K.
    NATURE STRUCTURAL & MOLECULAR BIOLOGY, 2018, 25 (03) : 270 - +
  • [33] URBER: Ultrafast Rule-Based Escape Routing Method for Large-Scale Sample Delivery Biochips
    Weng, Jiayi
    Ho, Tsung-Yi
    Ji, Weiqing
    Liu, Peng
    Bao, Mengdi
    Yao, Hailong
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2020, 39 (01) : 157 - 170
  • [34] COX-2 Inhibitor Prediction With KNIME: A Codeless Automated Machine Learning-Based Virtual Screening Workflow
    Ghosh, Powsali
    Kumar, Ashok
    Singh, Sushil Kumar
    JOURNAL OF COMPUTATIONAL CHEMISTRY, 2025, 46 (02)
  • [35] An integrated machine learning-based virtual screening strategy for biological weeding in maize field: a case study with HPPD
    Antony, Ajitha
    Karuppasamy, Ramanathan
    JOURNAL OF PLANT DISEASES AND PROTECTION, 2023, 130 (06) : 1433 - 1449
  • [36] Machine Learning-Based Virtual Screening and Molecular Simulation Approaches Identified Novel Potential Inhibitors for Cancer Therapy
    Shahab, Muhammad
    Zheng, Guojun
    Khan, Abbas
    Wei, Dongqing
    Novikov, Alexander S.
    BIOMEDICINES, 2023, 11 (08)
  • [37] A large-scale screening of metal-organic frameworks for iodine capture combining molecular simulation and machine learning
    Cheng, Min
    Zhang, Zhiyuan
    Wang, Shihui
    Bi, Kexin
    Hu, Kong-qiu
    Dai, Zhongde
    Dai, Yiyang
    Liu, Chong
    Zhou, Li
    Ji, Xu
    Shi, Wei-qun
    FRONTIERS OF ENVIRONMENTAL SCIENCE & ENGINEERING, 2023, 17 (12)
  • [38] Novel molecular classification and prognosis of papillary renal cell carcinoma based on a large-scale CRISPR-Cas9 screening and machine learning
    Liu, Chang
    Yuan, Zhan-Yuan
    Zhang, Xiao-Xun
    Chang, Jia-Jun
    Yang, Yang
    Sun, Sheng-Jia
    Du, Yinan
    Zhan, He-Qin
    HELIYON, 2024, 10 (01)
  • [39] GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank
    You, Ronghui
    Zhang, Zihan
    Xiong, Yi
    Sun, Fengzhu
    Mamitsuka, Hiroshi
    Zhu, Shanfeng
    BIOINFORMATICS, 2018, 34 (14) : 2465 - 2473
  • [40] Large-scale evaluation of cytochrome P450 2C9 mediated drug interaction potential with machine learning-based consensus modeling
    Racz, Anita
    Keseru, Gyorgy M.
    JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2020, 34 (08) : 831 - 839