Identifying and embedding transferability in data-driven representations of chemical space

被引:4
|
作者
Gould, Tim [1 ]
Chan, Bun [2 ]
Dale, Stephen G. [1 ,3 ]
Vuckovic, Stefan [4 ]
机构
[1] Griffith Univ, Queensland Micro & Nanotechnol Ctr, Nathan, Qld 4111, Australia
[2] Nagasaki Univ, Grad Sch Engn, Bunkyo 1-14, Nagasaki 8528521, Japan
[3] Natl Univ Singapore, Inst Funct Intelligent Mat, 4 Sci Dr 2, Singapore 117544, Singapore
[4] Univ Fribourg, Dept Chem, Fribourg, Switzerland
基金
日本学术振兴会; 瑞士国家科学基金会; 澳大利亚研究理事会;
关键词
DENSITY-FUNCTIONAL THEORY; EXCHANGE; THERMOCHEMISTRY; APPROXIMATIONS; DFT; AI;
D O I
10.1039/d4sc02358g
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Transferability, especially in the context of model generalization, is a paradigm of all scientific disciplines. However, the rapid advancement of machine learned model development threatens this paradigm, as it can be difficult to understand how transferability is embedded (or missed) in complex models developed using large training data sets. Two related open problems are how to identify, without relying on human intuition, what makes training data transferable; and how to embed transferability into training data. To solve both problems for ab initio chemical modelling, an indispensable tool in everyday chemistry research, we introduce a transferability assessment tool (TAT) and demonstrate it on a controllable data-driven model for developing density functional approximations (DFAs). We reveal that human intuition in the curation of training data introduces chemical biases that can hamper the transferability of data-driven DFAs. We use our TAT to motivate three transferability principles; one of which introduces the key concept of transferable diversity. Finally, we propose data curation strategies for general-purpose machine learning models in chemistry that identify and embed the transferability principles. We show that human intuition in the curation of training data introduces biases that hamper model transferability. We introduce a transferability assessment tool which rigorously measures and subsequently improves transferability.
引用
收藏
页码:11122 / 11133
页数:12
相关论文
共 50 条
  • [31] Data-Driven Backstepping Control of Chemical Process
    Gao, Jiawen
    Huang, Jingwen
    PROCEEDINGS OF 2020 IEEE 9TH DATA DRIVEN CONTROL AND LEARNING SYSTEMS CONFERENCE (DDCLS'20), 2020, : 817 - 821
  • [32] Data-driven paradigm for encoding chemical intuition
    Pyzer-Knapp, Edward
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2015, 250
  • [33] DATA-DRIVEN STOCHASTIC REPRESENTATIONS OF UNRESOLVED FEATURES IN MULTISCALE MODELS
    Verheul, Nick
    Crommelin, Daan
    COMMUNICATIONS IN MATHEMATICAL SCIENCES, 2016, 14 (05) : 1213 - 1236
  • [34] Expressivity of Parameterized and Data-driven Representations in Quality Diversity Search
    Hagg, Alexander
    Berns, Sebastian
    Asteroth, Alexander
    Colton, Simon
    Back, Thomas
    PROCEEDINGS OF THE 2021 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE (GECCO'21), 2021, : 678 - 686
  • [35] Data-driven space planning: using Suma to collect data
    Eldermire, Erin R. B.
    JOURNAL OF THE MEDICAL LIBRARY ASSOCIATION, 2019, 107 (04) : 611 - 612
  • [36] Data-driven Space Science at ESAC Science Data Centre
    Martinez, Beatriz
    Barbarisi, Isa
    Gonzalez, Juan
    Fernandez, Monica
    Laantee, Caroline
    Merin, Bruno
    Nieto, Sara
    Perez, Hector
    Salgado, Jesus
    de Teodoro, Pilar
    ASTRONOMICAL DATA ANALYSIS SOFTWARE AND SYSTEMS XXVIII, 2019, 523 : 409 - 412
  • [37] A data-driven framework for identifying important components in complex systems
    Lu, Xuefei
    Baraldi, Piero
    Zio, Enrico
    RELIABILITY ENGINEERING & SYSTEM SAFETY, 2020, 204 (204)
  • [38] Identifying Subnetwork Fingerprints in Structural Connectomes: A Data-Driven Approach
    Munsell, Brent C.
    Hofesmann, Eric
    Delgaizo, John
    Styner, Martin
    Bonilha, Leonardo
    CONNECTOMICS IN NEUROIMAGING, 2017, 10511 : 79 - 88
  • [39] Identifying subpopulations of septic patients: A temporal data-driven approach
    Sharafoddini, Anis
    Dubin, Joel A.
    Lee, Joon
    COMPUTERS IN BIOLOGY AND MEDICINE, 2021, 130
  • [40] Heterogeneous data-driven aerodynamic modeling based on physical feature embedding
    Zhang, Weiwei
    Peng, Xuhao
    Kou, Jiaqing
    Wang, Xu
    CHINESE JOURNAL OF AERONAUTICS, 2024, 37 (03) : 1 - 6