Reformulating Reactivity Design for Data-Efficient Machine Learning

被引:3
作者
Lewis-Atwell, Toby [1 ,2 ]
Beechey, Daniel [2 ]
Simsek, Ozgur [2 ]
Grayson, Matthew N. [1 ]
机构
[1] Univ Bath, Dept Chem, Bath BA2 7AY, England
[2] Univ Bath, Dept Comp Sci, Bath BA2 7AY, England
基金
英国工程与自然科学研究理事会;
关键词
machine learning; activation barriers; catalystdesign; organic synthesis; data efficiency; REACTION BARRIERS; PREDICTION; ACTIVATION; CHEMISTRY;
D O I
10.1021/acscatal.3c02513
中图分类号
O64 [物理化学(理论化学)、化学物理学];
学科分类号
070304 ; 081704 ;
摘要
Machine learning (ML) can deliver rapid and accurate reaction barrier predictions for use in rational reactivity design. However, model training requires large data sets of typically thousands or tens of thousands of barriers that are very expensive to obtain computationally or experimentally. Furthermore, bespoke data sets are required for each region of interest in reaction space as models typically struggle to generalize. We have therefore reformulated the ML barrier prediction problem toward a much more data-efficient process: finding a reaction from a prespecified set with a desired target value. Our reformulation enables the rapid selection of reactions with purpose-specific activation barriers, for example, in the design of reactivity and selectivity in synthesis, catalyst design, toxicology, and covalent drug discovery, requiring just tens of accurately measured barriers. Importantly, our reformulation does not require generalization beyond the domain of the data set at hand, and we show excellent results for the highly toxicologically and synthetically relevant data sets of aza-Michael addition and transition-metal-catalyzed dihydrogen activation, typically requiring less than 20 accurately measured density functional theory (DFT) barriers. Even for incomplete data sets of E2 and S(N)2 reactions, with high numbers of missing barriers (74% and 56% respectively), our chosen ML search method still requires significantly fewer data points than the hundreds or thousands needed for more conventional uses of ML to predict activation barriers. Finally, we include a case study in which we use our process to guide the optimization of the dihydrogen activation catalyst. Our approach was able to identify a reaction within 1 kcal mol(-1) of the target barrier by only having to run 12 DFT reaction barrier calculations, which illustrates the usage and real-world applicability of this reformulation for systems of high synthetic importance.
引用
收藏
页码:13506 / 13515
页数:10
相关论文
共 56 条
  • [51] Density Functional Theory Transition-State Modeling for the Prediction of Ames Mutagenicity in 1,4 Michael Acceptors
    Townsend, Piers A.
    Grayson, Matthew N.
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2019, 59 (12) : 5099 - 5103
  • [52] Current status of transition-state theory
    Truhlar, DG
    Garrett, BC
    Klippenstein, SJ
    [J]. JOURNAL OF PHYSICAL CHEMISTRY, 1996, 100 (31) : 12771 - 12800
  • [53] von Rudorff G. F., 2020, MATERIALS CLOUD ARCH, V2022, DOI [10.24435/materialscloud:sf-tz, DOI 10.24435/MATERIALSCLOUD:SF-TZ]
  • [54] Balanced Distribution Adaptation for Transfer Learning
    Wang, Jindong
    Chen, Yiqiang
    Hao, Shuji
    Feng, Wenjie
    Shen, Zhiqi
    [J]. 2017 17TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2017, : 1129 - 1134
  • [55] Predicting Materials Properties with Little Data Using Shotgun Transfer Learning
    Yamada, Hironao
    Liu, Chang
    Wu, Stephen
    Koyama, Yukinori
    Ju, Shenghong
    Shiomi, Junichiro
    Morikawa, Junko
    Yoshida, Ryo
    [J]. ACS CENTRAL SCIENCE, 2019, 5 (10) : 1717 - 1730
  • [56] Data augmentation and transfer learning strategies for reaction prediction in low chemical data regimes
    Zhang, Yun
    Wang, Ling
    Wang, Xinqiao
    Zhang, Chengyun
    Ge, Jiamin
    Tang, Jing
    Su, An
    Duan, Hongliang
    [J]. ORGANIC CHEMISTRY FRONTIERS, 2021, 8 (07) : 1415 - 1423