Prediction of chemical reaction yields with large-scale multi-view pre-training

被引：3

作者：

Shi, Runhan ^{[1
,2
]}

Yu, Gufeng ^{[1
,2
]}

Huo, Xiaohong ^{[3
]}

Yang, Yang ^{[1
,2
]}

机构：

[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai 200240, Peoples R China

[2] Shanghai Jiao Tong Univ, Key Lab Shanghai Educ Commiss Intelligent Interact, Shanghai 200240, Peoples R China

[3] Shanghai Jiao Tong Univ, Shanghai Key Lab Mol Engn & Chiral Drugs, Frontiers Sci Ctr Transformat Mol, Sch Chem & Chem Engn, Shanghai 200240, Peoples R China

来源：

JOURNAL OF CHEMINFORMATICS | 2024年 / 16卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Chemical reaction yield prediction; Self-supervised learning; Multi-view; INFORMATION; LANGUAGE; SMILES; MODEL;

D O I：

10.1186/s13321-024-00815-2

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Developing machine learning models with high generalization capability for predicting chemical reaction yields is of significant interest and importance. The efficacy of such models depends heavily on the representation of chemical reactions, which has commonly been learned from SMILES or graphs of molecules using deep neural networks. However, the progression of chemical reactions is inherently determined by the molecular 3D geometric properties, which have been recently highlighted as crucial features in accurately predicting molecular properties and chemical reactions. Additionally, large-scale pre-training has been shown to be essential in enhancing the generalization capability of complex deep learning models. Based on these considerations, we propose the Reaction Multi-View Pre-training (ReaMVP) framework, which leverages self-supervised learning techniques and a two-stage pre-training strategy to predict chemical reaction yields. By incorporating multi-view learning with 3D geometric information, ReaMVP achieves state-of-the-art performance on two benchmark datasets. Notably, the experimental results indicate that ReaMVP has a significant advantage in predicting out-of-sample data, suggesting an enhanced generalization ability to predict new reactions. Scientific Contribution: This study presents the ReaMVP framework, which improves the generalization capability of machine learning models for predicting chemical reaction yields. By integrating sequential and geometric views and leveraging self-supervised learning techniques with a two-stage pre-training strategy, ReaMVP achieves state-of-the-art performance on benchmark datasets. The framework demonstrates superior predictive ability for out-of-sample data and enhances the prediction of new reactions.

引用

页数：16

共 25 条

[1] Prediction of chemical reaction yields with large-scale multi-view pre-training
Runhan Shi
Gufeng Yu
Xiaohong Huo
Yang Yang
Journal of Cheminformatics, 16
[2] Pre-training on Large-Scale Heterogeneous Graph
Jiang, Xunqiang
Jia, Tianrui
Fang, Yuan
Shi, Chuan
Lin, Zhe
Wang, Hui
KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 756 - 766
[3] Pre-Training General Trajectory Embeddings With Maximum Multi-View Entropy Coding
Lin, Yan
Wan, Huaiyu
Guo, Shengnan
Hu, Jilin
Jensen, Christian S.
Lin, Youfang
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (12) : 9037 - 9050
[4] SelfPAB: large-scale pre-training on accelerometer data for human activity recognition
Logacjov, Aleksej
Herland, Sverre
Ustad, Astrid
Bach, Kerstin
APPLIED INTELLIGENCE, 2024, 54 (06) : 4545 - 4563
[5] SelfPAB: large-scale pre-training on accelerometer data for human activity recognition
Aleksej Logacjov
Sverre Herland
Astrid Ustad
Kerstin Bach
Applied Intelligence, 2024, 54 : 4545 - 4563
[6] Large-scale online multi-view graph neural network and applications
Li, Zhao
Xing, Yuying
Huang, Jiaming
Wang, Haobo
Gao, Jianliang
Yu, Guoxian
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2021, 116 : 145 - 155
[7] Efficient Supervised Discrete Multi-View Hashing for Large-Scale Multimedia Search
Lu, Xu
Zhu, Lei
Li, Jingjing
Zhang, Huaxiang
Shen, Heng Tao
IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (08) : 2048 - 2060
[8] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
Chen, Sanyuan
Wang, Chengyi
Chen, Zhengyang
Wu, Yu
Liu, Shujie
Chen, Zhuo
Li, Jinyu
Kanda, Naoyuki
Yoshioka, Takuya
Xiao, Xiong
Wu, Jian
Zhou, Long
Ren, Shuo
Qian, Yanmin
Qian, Yao
Zeng, Michael
Yu, Xiangzhan
Wei, Furu
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1505 - 1518
[9] Thyroid ultrasound diagnosis improvement via multi-view self-supervised learning and two-stage pre-training
Wang, Jian
Yang, Xin
Jia, Xiaohong
Xue, Wufeng
Chen, Rusi
Chen, Yanlin
Zhu, Xiliang
Liu, Lian
Cao, Yan
Zhou, Jianqiao
Ni, Dong
Gu, Ning
COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 171
[10] HySparK: Hybrid Sparse Masking for Large Scale Medical Image Pre-training
Tang, Fenghe
Xu, Ronghao
Yao, Qingsong
Fu, Xueming
Quan, Quan
Zhu, Heqin
Liu, Zaiyi
Zhou, S. Kevin
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT XI, 2024, 15011 : 330 - 340

← 1 2 3 →