Extracting Predictive Representations from Hundreds of Millions of Molecules

被引:48
作者
Chen, Dong [1 ,2 ]
Zheng, Jiaxin [1 ]
Wei, Guo-Wei [2 ,3 ,4 ]
Pan, Feng [1 ]
机构
[1] Peking Univ, Sch Adv Mat, Shenzhen Grad Sch, Shenzhen 518055, Peoples R China
[2] Michigan State Univ, Dept Math, E Lansing, MI 48824 USA
[3] Michigan State Univ, Dept Biochem & Mol Biol, E Lansing, MI 48824 USA
[4] Michigan State Univ, Dept Elect & Comp Engn, E Lansing, MI 48824 USA
关键词
DATABASE; FINGERPRINTS; LANGUAGE; PLATFORM; SETS;
D O I
10.1021/acs.jpclett.1c03058
中图分类号
O64 [物理化学(理论化学)、化学物理学];
学科分类号
070304 ; 081704 ;
摘要
The construction of appropriate representations remains essential for molecular predictions due to intricate molecular complexity. Additionally, it is often expensive and ethically constrained to generate labeled data for supervised learning in molecular sciences, leading to challenging small and diverse data sets. In this work, we develop a self-supervised learning approach to pretrain models from over 700 million unlabeled molecules in multiple databases. The intrinsic chemical logic learned from this approach enables the extraction of predictive representations from task-specific molecular sequences in a fine-tuned process. To understand the importance of self-supervised learning from unlabeled molecules, we assemble three models with different combinations of databases. Moreover, we propose a protocol based on data traits to automatically select the optimal model for a specific task. To validate the proposed method, we consider 10 benchmarks and 38 virtual screening data sets. Extensive validation indicates that the proposed method shows superb performance.
引用
收藏
页码:10793 / 10801
页数:9
相关论文
共 44 条
[1]  
[Anonymous], 1999, Neural Networks in Chemistry and Drug Design
[2]   Representation Learning: A Review and New Perspectives [J].
Bengio, Yoshua ;
Courville, Aaron ;
Vincent, Pascal .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (08) :1798-1828
[3]  
Cang Z., 2015, Computational and Mathematical Biophysics, V3, P140, DOI DOI 10.1515/MLBMB-2015-0009
[4]   Algebraic graph-assisted bidirectional transformers for molecular property prediction [J].
Chen, Dong ;
Gao, Kaifu ;
Duc Duy Nguyen ;
Chen, Xin ;
Jiang, Yi ;
Wei, Guo-Wei ;
Pan, Feng .
NATURE COMMUNICATIONS, 2021, 12 (01)
[5]   ESOL: Estimating aqueous solubility directly from molecular structure [J].
Delaney, JS .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2004, 44 (03) :1000-1005
[6]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]   A review of mathematical representations of biomolecular data [J].
Duc Duy Nguyen ;
Cang, Zixuan ;
Wei, Guo-Wei .
PHYSICAL CHEMISTRY CHEMICAL PHYSICS, 2020, 22 (08) :4343-4367
[8]   DG-GL: Differential geometry-based geometric learning of molecular datasets [J].
Duc Duy Nguyen ;
Wei, Guo-Wei .
INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN BIOMEDICAL ENGINEERING, 2019, 35 (03)
[9]   Are 2D fingerprints still valuable for drug discovery? [J].
Gao, Kaifu ;
Duc Duy Nguyen ;
Sresht, Vishnu ;
Mathiowetz, Alan M. ;
Tu, Meihua ;
Wei, Guo-Wei .
PHYSICAL CHEMISTRY CHEMICAL PHYSICS, 2020, 22 (16) :8373-8390
[10]   The ChEMBL database in 2017 [J].
Gaulton, Anna ;
Hersey, Anne ;
Nowotka, Michal ;
Bento, A. Patricia ;
Chambers, Jon ;
Mendez, David ;
Mutowo, Prudence ;
Atkinson, Francis ;
Bellis, Louisa J. ;
Cibrian-Uhalte, Elena ;
Davies, Mark ;
Dedman, Nathan ;
Karlsson, Anneli ;
Magarinos, Maria Paula ;
Overington, John P. ;
Papadatos, George ;
Smit, Ines ;
Leach, Andrew R. .
NUCLEIC ACIDS RESEARCH, 2017, 45 (D1) :D945-D954