Extracting Predictive Representations from Hundreds of Millions of Molecules

被引:40
作者
Chen, Dong [1 ,2 ]
Zheng, Jiaxin [1 ]
Wei, Guo-Wei [2 ,3 ,4 ]
Pan, Feng [1 ]
机构
[1] Peking Univ, Sch Adv Mat, Shenzhen Grad Sch, Shenzhen 518055, Peoples R China
[2] Michigan State Univ, Dept Math, E Lansing, MI 48824 USA
[3] Michigan State Univ, Dept Biochem & Mol Biol, E Lansing, MI 48824 USA
[4] Michigan State Univ, Dept Elect & Comp Engn, E Lansing, MI 48824 USA
来源
JOURNAL OF PHYSICAL CHEMISTRY LETTERS | 2021年 / 12卷 / 44期
关键词
DATABASE; FINGERPRINTS; LANGUAGE; PLATFORM; SETS;
D O I
10.1021/acs.jpclett.1c03058
中图分类号
O64 [物理化学(理论化学)、化学物理学];
学科分类号
070304 ; 081704 ;
摘要
The construction of appropriate representations remains essential for molecular predictions due to intricate molecular complexity. Additionally, it is often expensive and ethically constrained to generate labeled data for supervised learning in molecular sciences, leading to challenging small and diverse data sets. In this work, we develop a self-supervised learning approach to pretrain models from over 700 million unlabeled molecules in multiple databases. The intrinsic chemical logic learned from this approach enables the extraction of predictive representations from task-specific molecular sequences in a fine-tuned process. To understand the importance of self-supervised learning from unlabeled molecules, we assemble three models with different combinations of databases. Moreover, we propose a protocol based on data traits to automatically select the optimal model for a specific task. To validate the proposed method, we consider 10 benchmarks and 38 virtual screening data sets. Extensive validation indicates that the proposed method shows superb performance.
引用
收藏
页码:10793 / 10801
页数:9
相关论文
共 44 条
  • [1] [Anonymous], 1999, Neural Networks in Chemistry and Drug Design
  • [2] Representation Learning: A Review and New Perspectives
    Bengio, Yoshua
    Courville, Aaron
    Vincent, Pascal
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (08) : 1798 - 1828
  • [3] Cang Z. X., 2015, Comput. Math. Biophys., V3, P140, DOI DOI 10.1515/MLBMB-2015-0009
  • [4] Algebraic graph-assisted bidirectional transformers for molecular property prediction
    Chen, Dong
    Gao, Kaifu
    Duc Duy Nguyen
    Chen, Xin
    Jiang, Yi
    Wei, Guo-Wei
    Pan, Feng
    [J]. NATURE COMMUNICATIONS, 2021, 12 (01)
  • [5] ESOL: Estimating aqueous solubility directly from molecular structure
    Delaney, JS
    [J]. JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2004, 44 (03): : 1000 - 1005
  • [6] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [7] A review of mathematical representations of biomolecular data
    Duc Duy Nguyen
    Cang, Zixuan
    Wei, Guo-Wei
    [J]. PHYSICAL CHEMISTRY CHEMICAL PHYSICS, 2020, 22 (08) : 4343 - 4367
  • [8] DG-GL: Differential geometry-based geometric learning of molecular datasets
    Duc Duy Nguyen
    Wei, Guo-Wei
    [J]. INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN BIOMEDICAL ENGINEERING, 2019, 35 (03)
  • [9] Are 2D fingerprints still valuable for drug discovery?
    Gao, Kaifu
    Duc Duy Nguyen
    Sresht, Vishnu
    Mathiowetz, Alan M.
    Tu, Meihua
    Wei, Guo-Wei
    [J]. PHYSICAL CHEMISTRY CHEMICAL PHYSICS, 2020, 22 (16) : 8373 - 8390
  • [10] The ChEMBL database in 2017
    Gaulton, Anna
    Hersey, Anne
    Nowotka, Michal
    Bento, A. Patricia
    Chambers, Jon
    Mendez, David
    Mutowo, Prudence
    Atkinson, Francis
    Bellis, Louisa J.
    Cibrian-Uhalte, Elena
    Davies, Mark
    Dedman, Nathan
    Karlsson, Anneli
    Magarinos, Maria Paula
    Overington, John P.
    Papadatos, George
    Smit, Ines
    Leach, Andrew R.
    [J]. NUCLEIC ACIDS RESEARCH, 2017, 45 (D1) : D945 - D954