Data-Efficient, Chemistry-Aware Machine Learning Predictions of Diels-Alder Reaction Outcomes

被引:6
作者
Keto, Angus [1 ]
Guo, Taicheng [2 ]
Underdue, Morgan [3 ]
Stuyver, Thijs [4 ,5 ]
Coley, Connor W. [4 ]
Zhang, Xiangliang [2 ]
Krenske, Elizabeth H. [1 ]
Wiest, Olaf [3 ]
机构
[1] Univ Queensland, Sch Chem & Mol Biosci, Brisbane, Qld 4072, Australia
[2] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46556 USA
[3] Univ Notre Dame, Dept Chem & Biochem, Notre Dame, IN 46556 USA
[4] MIT, Dept Chem Engn, Cambridge, MA 02139 USA
[5] Univ PSL, Ecole Natl Super Chim Paris, Inst Chem Life & Hlth Sci, CNRS, F-75005 Paris, France
基金
澳大利亚研究理事会; 美国国家科学基金会;
关键词
MOLECULAR-ORBITAL METHODS; REACTION BARRIERS; MODEL;
D O I
10.1021/jacs.4c03131
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The application of machine learning models to the prediction of reaction outcomes currently needs large and/or highly featurized data sets. We show that a chemistry-aware model, NERF, which mimics the bonding changes that occur during reactions, allows for highly accurate predictions of the outcomes of Diels-Alder reactions using a relatively small training set, with no pretraining and no additional features. We establish a diverse data set of 9537 intramolecular, hetero-, aromatic, and inverse electron demand Diels-Alder reactions. This data set is used to train a NERF model, and the performance is compared against state-of-the-art classification and generative machine learning models across low- and high-data regimes, with and without pretraining. The predictive accuracy (regio- and site selectivity in the major product) achieved by NERF exceeds 90% when as little as 40% of the data set is used for training. Another high-performing model, Chemformer, requires a larger training data set (>45%) and pretraining to reach 90% Top-1 accuracy. Accurate predictions of less-represented reaction subclasses, such as those involving heteroatomic or aromatic substrates, require higher percentages of training data. We also show how NERF can use small amounts of additional training data to quickly learn new systems and improve its overall understanding of reactivity. Synthetic chemists stand to benefit as this model can be rapidly expanded and tailored to areas of chemistry corresponding to the low-data regime.
引用
收藏
页码:16052 / 16061
页数:10
相关论文
共 2 条
  • [1] Machine Learning to Predict Diels-Alder Reaction Barriers from the Reactant State Electron Density
    Vargas, Santiago
    Hennefarth, Matthew R.
    Liu, Zhihao
    Alexandrova, Anastassia N.
    JOURNAL OF CHEMICAL THEORY AND COMPUTATION, 2021, 17 (10) : 6203 - 6213
  • [2] Reformulating Reactivity Design for Data-Efficient Machine Learning
    Lewis-Atwell, Toby
    Beechey, Daniel
    Simsek, Ozgur
    Grayson, Matthew N.
    ACS CATALYSIS, 2023, 13 (20) : 13506 - 13515