Constant size descriptors for accurate machine learning models of molecular properties

被引:83
|
作者
Collins, Christopher R. [1 ]
Gordon, Geoffrey J. [2 ]
von Lilienfeld, O. Anatole [3 ,4 ]
Yaron, David J. [1 ]
机构
[1] Carnegie Mellon Univ, Dept Chem, 4400 5th Ave, Pittsburgh, PA 15213 USA
[2] Carnegie Mellon Univ, Machine Learning Dept, Pittsburgh, PA 15213 USA
[3] Univ Basel, Inst Phys Chem, Dept Chem, CH-4056 Basel, Switzerland
[4] Univ Basel, Natl Ctr Computat Design & Discovery Novel Mat MA, CH-4056 Basel, Switzerland
来源
JOURNAL OF CHEMICAL PHYSICS | 2018年 / 148卷 / 24期
基金
美国国家科学基金会;
关键词
ORGANIC PHOTOVOLTAICS; QUANTUM-CHEMISTRY; ENERGIES; POTENTIALS; PREDICTION; SELECTION; DESIGN; KERNEL;
D O I
10.1063/1.5020441
中图分类号
O64 [物理化学(理论化学)、化学物理学];
学科分类号
070304 ; 081704 ;
摘要
Two different classes of molecular representations for use in machine learning of thermodynamic and electronic properties are studied. The representations are evaluated by monitoring the performance of linear and kernel ridge regression models on well-studied data sets of small organic molecules. One class of representations studied here counts the occurrence of bonding patterns in the molecule. These require only the connectivity of atoms in the molecule as may be obtained from a line diagram or a SMILES string. The second class utilizes the three-dimensional structure of the molecule. These include the Coulomb matrix and Bag of Bonds, which list the inter-atomic distances present in the molecule, and Encoded Bonds, which encode such lists into a feature vector whose length is independent of molecular size. Encoded Bonds' features introduced here have the advantage of leading to models that may be trained on smaller molecules and then used successfully on larger molecules. A wide range of feature sets are constructed by selecting, at each rank, either a graph or geometry-based feature. Here, rank refers to the number of atoms involved in the feature, e.g., atom counts are rank 1, while Encoded Bonds are rank 2. For atomization energies in the QM7 data set, the best graph-based feature set gives a mean absolute error of 3.4 kcal/mol. Inclusion of 3D geometry substantially enhances the performance, with Encoded Bonds giving 2.4 kcal/mol, when used alone, and 1.19 kcal/mol, when combined with graph features. Published by AIP Publishing.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Machine learning models for highly-multidimensional molecular descriptors.
    Yu, JS
    Mydlowec, W
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2002, 224 : U511 - U511
  • [2] Prediction of Soil Adsorption Coefficient in Pesticides Using Physicochemical Properties and Molecular Descriptors by Machine Learning Models
    Kobayashi, Yoshiyuki
    Uchida, Takumi
    Yoshida, Kenichi
    ENVIRONMENTAL TOXICOLOGY AND CHEMISTRY, 2020, 39 (07) : 1451 - 1459
  • [3] Gaussian Moments as Physically Inspired Molecular Descriptors for Accurate and Scalable Machine Learning Potentials
    Zaverkin, V
    Kastner, J.
    JOURNAL OF CHEMICAL THEORY AND COMPUTATION, 2020, 16 (08) : 5410 - 5421
  • [4] Machine Learning Approach for the Estimation of Henry's Law Constant Based on Molecular Descriptors
    Ullah, Atta
    Shaheryar, Muhammad
    Lim, Ho-Jin
    ATMOSPHERE, 2024, 15 (06)
  • [5] Harnessing Shannon entropy-based descriptors in machine learning models to enhance the prediction accuracy of molecular properties
    Guha, Rajarshi
    Velegol, Darrell
    JOURNAL OF CHEMINFORMATICS, 2023, 15 (01)
  • [6] Harnessing Shannon entropy-based descriptors in machine learning models to enhance the prediction accuracy of molecular properties
    Rajarshi Guha
    Darrell Velegol
    Journal of Cheminformatics, 15
  • [7] Development of QSAR models for prediction of fish bioconcentration factors using physicochemical properties and molecular descriptors with machine learning algorithms
    Kobayashi, Yoshiyuki
    Yoshida, Kenichi
    ECOLOGICAL INFORMATICS, 2021, 63
  • [8] Cocrystal Prediction Using Machine Learning Models and Descriptors
    Mswahili, Medard Edmund
    Lee, Min-Jeong
    Martin, Gati Lother
    Kim, Junghyun
    Kim, Paul
    Choi, Guang J.
    Jeong, Young-Seob
    APPLIED SCIENCES-BASEL, 2021, 11 (03): : 1 - 12
  • [9] Machine Learning Models for Predicting Monoclonal Antibody Biophysical Properties from Molecular Dynamics Simulations and Deep Learning-Based Surface Descriptors
    Wu, I-En
    Kalejaye, Lateefat
    Lai, Pin-Kuang
    MOLECULAR PHARMACEUTICS, 2024, 22 (01) : 142 - 153
  • [10] Classification of signaling proteins based on molecular star graph descriptors using Machine Learning models
    Fernandez-Lozano, Carlos
    Cuinas, Ruben F.
    Seoane, Jose A.
    Fernandez-Blanco, Enrique
    Dorado, Julian
    Munteanu, Cristian R.
    JOURNAL OF THEORETICAL BIOLOGY, 2015, 384 : 50 - 58