Constant size descriptors for accurate machine learning models of molecular properties

被引:83
|
作者
Collins, Christopher R. [1 ]
Gordon, Geoffrey J. [2 ]
von Lilienfeld, O. Anatole [3 ,4 ]
Yaron, David J. [1 ]
机构
[1] Carnegie Mellon Univ, Dept Chem, 4400 5th Ave, Pittsburgh, PA 15213 USA
[2] Carnegie Mellon Univ, Machine Learning Dept, Pittsburgh, PA 15213 USA
[3] Univ Basel, Inst Phys Chem, Dept Chem, CH-4056 Basel, Switzerland
[4] Univ Basel, Natl Ctr Computat Design & Discovery Novel Mat MA, CH-4056 Basel, Switzerland
来源
JOURNAL OF CHEMICAL PHYSICS | 2018年 / 148卷 / 24期
基金
美国国家科学基金会;
关键词
ORGANIC PHOTOVOLTAICS; QUANTUM-CHEMISTRY; ENERGIES; POTENTIALS; PREDICTION; SELECTION; DESIGN; KERNEL;
D O I
10.1063/1.5020441
中图分类号
O64 [物理化学(理论化学)、化学物理学];
学科分类号
070304 ; 081704 ;
摘要
Two different classes of molecular representations for use in machine learning of thermodynamic and electronic properties are studied. The representations are evaluated by monitoring the performance of linear and kernel ridge regression models on well-studied data sets of small organic molecules. One class of representations studied here counts the occurrence of bonding patterns in the molecule. These require only the connectivity of atoms in the molecule as may be obtained from a line diagram or a SMILES string. The second class utilizes the three-dimensional structure of the molecule. These include the Coulomb matrix and Bag of Bonds, which list the inter-atomic distances present in the molecule, and Encoded Bonds, which encode such lists into a feature vector whose length is independent of molecular size. Encoded Bonds' features introduced here have the advantage of leading to models that may be trained on smaller molecules and then used successfully on larger molecules. A wide range of feature sets are constructed by selecting, at each rank, either a graph or geometry-based feature. Here, rank refers to the number of atoms involved in the feature, e.g., atom counts are rank 1, while Encoded Bonds are rank 2. For atomization energies in the QM7 data set, the best graph-based feature set gives a mean absolute error of 3.4 kcal/mol. Inclusion of 3D geometry substantially enhances the performance, with Encoded Bonds giving 2.4 kcal/mol, when used alone, and 1.19 kcal/mol, when combined with graph features. Published by AIP Publishing.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Quantum Chemical Roots of Machine-Learning Molecular Similarity Descriptors
    Gugler, Stefan
    Reiher, Markus
    JOURNAL OF CHEMICAL THEORY AND COMPUTATION, 2022, : 6670 - 6689
  • [32] A machine learning strategy for the identification of key in silico descriptors and prediction models for IgG monoclonal antibody developability properties
    Waight, Andrew B.
    Prihoda, David
    Shrestha, Rojan
    Metcalf, Kevin
    Bailly, Marc
    Ancona, Marco
    Widatalla, Talal
    Rollins, Zachary
    Cheng, Alan C.
    Bitton, Danny A.
    Fayadat-Dilman, Laurence
    MABS, 2023, 15 (01)
  • [33] Genetic Optimization of Training Sets for Improved Machine Learning Models of Molecular Properties
    Browning, Nicholas J.
    Ramakrishnan, Rapunathan
    von Lilienfeld, O. Anatole
    Roethlisberger, Ursula
    JOURNAL OF PHYSICAL CHEMISTRY LETTERS, 2017, 8 (07): : 1351 - 1359
  • [34] Can machine learning models provide accurate fertilizer recommendations?
    Tanaka, Takashi S. T.
    Heuvelink, Gerard B. M.
    Mieno, Taro
    Bullock, David S.
    PRECISION AGRICULTURE, 2024, 25 (04) : 1839 - 1856
  • [35] Learning Accurate Integer Transformer Machine-Translation Models
    Wu E.
    SN Computer Science, 2021, 2 (4)
  • [36] Structural descriptors evaluation for MoTa mechanical properties prediction with machine learning
    Tao, Tingpeng
    Li, Shu
    Chen, Dechuang
    Li, Shuai
    Liu, Dongrong
    Liu, Xin
    Chen, Minghua
    MODELLING AND SIMULATION IN MATERIALS SCIENCE AND ENGINEERING, 2024, 32 (02)
  • [37] Machine learning models for accurate prioritization of variants of uncertain significance
    Mahecha, Daniel
    Nunez, Haydemar
    Lattig, Maria C.
    Duitama, Jorge
    HUMAN MUTATION, 2022, 43 (04) : 449 - 460
  • [38] Leveraging composition-based energy material descriptors for machine learning models
    Trezza, Giovanni
    Chiavazzo, Eliodoro
    MATERIALS TODAY COMMUNICATIONS, 2023, 36
  • [39] Atomistic Descriptors for Machine Learning Models of Solubility Parameters for Small Molecules and Polymers
    Chi, Mingzhe
    Gargouri, Rihab
    Schrader, Tim
    Damak, Kamel
    Maalej, Ramzi
    Sierka, Marek
    POLYMERS, 2022, 14 (01)
  • [40] MoDeSuS: A Machine Learning Tool for Selection of Molecular Descriptors in QSAR Studies Applied to Molecular Informatics
    Jimena Martinez, Maria
    Razuc, Marina
    Ponzoni, Ignacio
    BIOMED RESEARCH INTERNATIONAL, 2019, 2019