Constant size descriptors for accurate machine learning models of molecular properties

被引:83
|
作者
Collins, Christopher R. [1 ]
Gordon, Geoffrey J. [2 ]
von Lilienfeld, O. Anatole [3 ,4 ]
Yaron, David J. [1 ]
机构
[1] Carnegie Mellon Univ, Dept Chem, 4400 5th Ave, Pittsburgh, PA 15213 USA
[2] Carnegie Mellon Univ, Machine Learning Dept, Pittsburgh, PA 15213 USA
[3] Univ Basel, Inst Phys Chem, Dept Chem, CH-4056 Basel, Switzerland
[4] Univ Basel, Natl Ctr Computat Design & Discovery Novel Mat MA, CH-4056 Basel, Switzerland
来源
JOURNAL OF CHEMICAL PHYSICS | 2018年 / 148卷 / 24期
基金
美国国家科学基金会;
关键词
ORGANIC PHOTOVOLTAICS; QUANTUM-CHEMISTRY; ENERGIES; POTENTIALS; PREDICTION; SELECTION; DESIGN; KERNEL;
D O I
10.1063/1.5020441
中图分类号
O64 [物理化学(理论化学)、化学物理学];
学科分类号
070304 ; 081704 ;
摘要
Two different classes of molecular representations for use in machine learning of thermodynamic and electronic properties are studied. The representations are evaluated by monitoring the performance of linear and kernel ridge regression models on well-studied data sets of small organic molecules. One class of representations studied here counts the occurrence of bonding patterns in the molecule. These require only the connectivity of atoms in the molecule as may be obtained from a line diagram or a SMILES string. The second class utilizes the three-dimensional structure of the molecule. These include the Coulomb matrix and Bag of Bonds, which list the inter-atomic distances present in the molecule, and Encoded Bonds, which encode such lists into a feature vector whose length is independent of molecular size. Encoded Bonds' features introduced here have the advantage of leading to models that may be trained on smaller molecules and then used successfully on larger molecules. A wide range of feature sets are constructed by selecting, at each rank, either a graph or geometry-based feature. Here, rank refers to the number of atoms involved in the feature, e.g., atom counts are rank 1, while Encoded Bonds are rank 2. For atomization energies in the QM7 data set, the best graph-based feature set gives a mean absolute error of 3.4 kcal/mol. Inclusion of 3D geometry substantially enhances the performance, with Encoded Bonds giving 2.4 kcal/mol, when used alone, and 1.19 kcal/mol, when combined with graph features. Published by AIP Publishing.
引用
收藏
页数:11
相关论文
共 50 条
  • [41] Machine Learning for Accurate Force Calculations in Molecular Dynamics Simulations
    Pattnaik, Punyaslok
    Raghunathan, Shampa
    Kalluri, Tarun
    Bhimalapuram, Prabhakar
    Jawahar, C., V
    Priyakumar, U. Deva
    JOURNAL OF PHYSICAL CHEMISTRY A, 2020, 124 (34): : 6954 - 6967
  • [42] Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning
    Rupp, Matthias
    Tkatchenko, Alexandre
    Mueller, Klaus-Robert
    von Lilienfeld, O. Anatole
    PHYSICAL REVIEW LETTERS, 2012, 108 (05)
  • [43] Rational solvent selection in asymmetric hydrogenation using molecular descriptors and machine learning
    Amar, Yehia
    Schweidtmann, Artur
    Deutsch, Paul
    Lapkin, Alexei
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2018, 256
  • [44] Machine learning and molecular descriptors enable rational solvent selection in asymmetric catalysis
    Amar, Yehia
    Schweidtmann, ArturM.
    Deutsch, Paul
    Cao, Liwei
    Lapkin, Alexei
    CHEMICAL SCIENCE, 2019, 10 (27) : 6697 - 6706
  • [45] Prediction of acetylcholinesterase inhibitors and characterization of correlative molecular descriptors by machine learning methods
    Lv, Wei
    Xue, Ying
    EUROPEAN JOURNAL OF MEDICINAL CHEMISTRY, 2010, 45 (03) : 1167 - 1172
  • [46] Accurate molecular polarizabilities with coupled cluster theory and machine learning
    Wilkins, David M.
    Grisafi, Andrea
    Yang, Yang
    Lao, Ka Un
    DiStasio, Robert A., Jr.
    Ceriotti, Michele
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2019, 116 (09) : 3401 - 3406
  • [47] In Silico Prediction and Screening of γ-Secretase Inhibitors by Molecular Descriptors and Machine Learning Methods
    Yang, Xue-Gang
    Lv, Wei
    Chen, Yu-Zong
    Xue, Ying
    JOURNAL OF COMPUTATIONAL CHEMISTRY, 2010, 31 (06) : 1249 - 1258
  • [48] Machine learning of molecular properties: Locality and active learning
    Gubaev, Konstantin
    Podryabinkin, Evgeny V.
    Shapeev, Alexander V.
    JOURNAL OF CHEMICAL PHYSICS, 2018, 148 (24):
  • [49] CINF 61-Comparison of machine learning algorithms to predict ADME properties using chemical descriptors and molecular fingerprints
    Klon, Anthony E.
    Diller, David J.
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2008, 236
  • [50] Machine Learning Predictions of Molecular Properties: Accurate Many-Body Potentials and Nonlocality in Chemical Space
    Hansen, Katja
    Biegler, Franziska
    Ramakrishnan, Raghunathan
    Pronobis, Wiktor
    von Lilienfeld, O. Anatole
    Mueller, Klaus-Robert
    Tkatchenko, Alexandre
    JOURNAL OF PHYSICAL CHEMISTRY LETTERS, 2015, 6 (12): : 2326 - 2331