SMICLR: Contrastive Learning on Multiple Molecular Representations for Semisupervised and Unsupervised Representation Learning

被引:23
作者
Pinheiro, Gabriel A. [1 ]
Silva, Juarez L. F. [2 ]
Quiles, Marcos G. [1 ]
机构
[1] Fed Univ Sao Paulo Unifesp, Inst Sci & Technol, BR-12247014 Sao Jose Dos Campos, SP, Brazil
[2] Univ Sao Paulo, Sao Carlos Inst Chem, BR-13560970 Sao Carlos, SP, Brazil
基金
巴西圣保罗研究基金会;
关键词
PREDICTION; NETWORKS; LANGUAGE; MODELS; SMILES;
D O I
10.1021/acs.jcim.2c00521
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Machine learning as a tool for chemical space exploration broadens horizons to work with known and unknown molecules. At its core lies molecular representation, an essential key to improve learning about structure-property relationships. Recently, contrastive frameworks have been showing impressive results for representation learning in diverse domains. Therefore, this paper proposes a contrastive framework that embraces multimodal molecular data. Specifically, our approach jointly trains a graph encoder and an encoder for the simplified molecular-input line-entry system (SMILES) string to perform the contrastive learning objective. Since SMILES is the basis of our method, i.e., we built the molecular graph from the SMILES, we call our framework as SMILES Contrastive Learning (SMICLR). When stacking a nonlinear regressor on the SMICLR's pretrained encoder and fine-tuning the entire model, we reduced the prediction error by, on average, 44% and 25% for the energetic and electronic properties of the QM9 data set, respectively, over the supervised baseline. We further improved our framework's performance when applying data augmentations in each molecular-input representation. Moreover, SMICLR demonstrated competitive representation learning results in an unsupervised setting.
引用
收藏
页码:3948 / 3960
页数:13
相关论文
共 67 条
  • [1] Sub2Vec: Feature Learning for Subgraphs
    Adhikari, Bijaya
    Zhang, Yao
    Ramakrishnan, Naren
    Prakash, B. Aditya
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2018, PT II, 2018, 10938 : 170 - 182
  • [2] Goh GB, 2018, Arxiv, DOI arXiv:1710.02238
  • [3] Goh GB, 2017, Arxiv, DOI arXiv:1706.06689
  • [4] Protein function prediction via graph kernels
    Borgwardt, KM
    Ong, CS
    Schönauer, S
    Vishwanathan, SVN
    Smola, AJ
    Kriegel, HP
    [J]. BIOINFORMATICS, 2005, 21 : I47 - I56
  • [5] Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals
    Chen, Chi
    Ye, Weike
    Zuo, Yunxing
    Zheng, Chen
    Ong, Shyue Ping
    [J]. CHEMISTRY OF MATERIALS, 2019, 31 (09) : 3564 - 3572
  • [6] Chen T, 2020, PR MACH LEARN RES, V119
  • [7] Chung J, 2014, CORR, P1
  • [8] Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction
    Coley, Connor W.
    Barzilay, Regina
    Green, William H.
    Jaakkola, Tommi S.
    Jensen, Klavs F.
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2017, 57 (08) : 1757 - 1772
  • [9] Molecular representations in AI-driven drug discovery: a review and practical guide
    David, Laurianne
    Thakkar, Amol
    Mercado, Rocio
    Engkvist, Ola
    [J]. JOURNAL OF CHEMINFORMATICS, 2020, 12 (01)
  • [10] Systematic Investigation of Error Distribution in Machine Learning Algorithms Applied to the Quantum-Chemistry QM9 Data Set Using the Bias and Variance Decomposition
    de Azevedo, Luis Cesar
    Pinheiro, Gabriel A.
    Quiles, Marcos G.
    Da Silva, Juarez L. F.
    Prati, Ronaldo C.
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2021, 61 (09) : 4210 - 4223