Molecular contrastive learning of representations via graph neural networks

被引:427
作者
Wang, Yuyang [1 ,2 ]
Wang, Jianren [3 ]
Cao, Zhonglin [1 ]
Farimani, Amir Barati [1 ,2 ,4 ]
机构
[1] Carnegie Mellon Univ, Dept Mech Engn, Pittsburgh, PA 15213 USA
[2] Carnegie Mellon Univ, Machine Learning Dept, Pittsburgh, PA 15213 USA
[3] Carnegie Mellon Univ, Robot Inst, Pittsburgh, PA 15213 USA
[4] Carnegie Mellon Univ, Dept Chem Engn, Pittsburgh, PA 15213 USA
基金
美国安德鲁·梅隆基金会;
关键词
DISCOVERY; DESIGN; ART;
D O I
10.1038/s42256-022-00447-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Molecular machine learning bears promise for efficient molecular property prediction and drug discovery. However, labelled molecule data can be expensive and time consuming to acquire. Due to the limited labelled data, it is a great challenge for supervised-learning machine learning models to generalize to the giant chemical space. Here we present MoICLR (Molecular Contrastive Learning of Representations via Graph Neural Networks), a self-supervised learning framework that leverages large unlabelled data (similar to 10 million unique molecules). In MoICLR pre-training, we build molecule graphs and develop graph-neural-network encoders to learn differentiable representations. Three molecule graph augmentations are proposed: atom masking, bond deletion and subgraph removal. A contrastive estimator maximizes the agreement of augmentations from the same molecule while minimizing the agreement of different molecules. Experiments show that our contrastive learning framework significantly improves the performance of graph-neural-network encoders on various molecular property benchmarks including both classification and regression tasks. Benefiting from pre-training on the large unlabelled database, MoICLR even achieves state of the art on several challenging benchmarks after fine-tuning. In addition, further investigations demonstrate that MoICLR learns to embed molecules into representations that can distinguish chemically reasonable molecular similarities.
引用
收藏
页码:279 / 287
页数:9
相关论文
共 65 条
[1]   Low Data Drug Discovery with One-Shot Learning [J].
Altae-Tran, Han ;
Ramsundar, Bharath ;
Pappu, Aneesh S. ;
Pande, Vijay .
ACS CENTRAL SCIENCE, 2017, 3 (04) :283-293
[2]  
[Anonymous], 2012, Nucleic Acids Res, DOI DOI 10.1093/NAR/GKR777
[3]  
[Anonymous], 2014, Proceedings of the deep learning workshop at NIPS. datascienceassn.org
[4]   On representing chemical environments [J].
Bartok, Albert P. ;
Kondor, Risi ;
Csanyi, Gabor .
PHYSICAL REVIEW B, 2013, 87 (18)
[5]  
Bohacek RS, 1996, MED RES REV, V16, P3, DOI 10.1002/(SICI)1098-1128(199601)16:1<3::AID-MED1>3.3.CO
[6]  
2-D
[7]   Geometric Deep Learning Going beyond Euclidean data [J].
Bronstein, Michael M. ;
Bruna, Joan ;
LeCun, Yann ;
Szlam, Arthur ;
Vandergheynst, Pierre .
IEEE SIGNAL PROCESSING MAGAZINE, 2017, 34 (04) :18-42
[8]   GuacaMol: Benchmarking Models for de Novo Molecular Design [J].
Brown, Nathan ;
Fiscato, Marco ;
Segler, Marwin H. S. ;
Vaucher, Alain C. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2019, 59 (03) :1096-1108
[9]  
Chen T., 2020, Advances in neural information processing systems, P22243
[10]  
Chen T, 2020, PR MACH LEARN RES, V119