Prediction of chemical reaction yields with large-scale multi-view pre-training

被引:3
|
作者
Shi, Runhan [1 ,2 ]
Yu, Gufeng [1 ,2 ]
Huo, Xiaohong [3 ]
Yang, Yang [1 ,2 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai 200240, Peoples R China
[2] Shanghai Jiao Tong Univ, Key Lab Shanghai Educ Commiss Intelligent Interact, Shanghai 200240, Peoples R China
[3] Shanghai Jiao Tong Univ, Shanghai Key Lab Mol Engn & Chiral Drugs, Frontiers Sci Ctr Transformat Mol, Sch Chem & Chem Engn, Shanghai 200240, Peoples R China
基金
中国国家自然科学基金;
关键词
Chemical reaction yield prediction; Self-supervised learning; Multi-view; INFORMATION; LANGUAGE; SMILES; MODEL;
D O I
10.1186/s13321-024-00815-2
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Developing machine learning models with high generalization capability for predicting chemical reaction yields is of significant interest and importance. The efficacy of such models depends heavily on the representation of chemical reactions, which has commonly been learned from SMILES or graphs of molecules using deep neural networks. However, the progression of chemical reactions is inherently determined by the molecular 3D geometric properties, which have been recently highlighted as crucial features in accurately predicting molecular properties and chemical reactions. Additionally, large-scale pre-training has been shown to be essential in enhancing the generalization capability of complex deep learning models. Based on these considerations, we propose the Reaction Multi-View Pre-training (ReaMVP) framework, which leverages self-supervised learning techniques and a two-stage pre-training strategy to predict chemical reaction yields. By incorporating multi-view learning with 3D geometric information, ReaMVP achieves state-of-the-art performance on two benchmark datasets. Notably, the experimental results indicate that ReaMVP has a significant advantage in predicting out-of-sample data, suggesting an enhanced generalization ability to predict new reactions. Scientific Contribution: This study presents the ReaMVP framework, which improves the generalization capability of machine learning models for predicting chemical reaction yields. By integrating sequential and geometric views and leveraging self-supervised learning techniques with a two-stage pre-training strategy, ReaMVP achieves state-of-the-art performance on benchmark datasets. The framework demonstrates superior predictive ability for out-of-sample data and enhances the prediction of new reactions.
引用
收藏
页数:16
相关论文
共 25 条
  • [1] Prediction of chemical reaction yields with large-scale multi-view pre-training
    Runhan Shi
    Gufeng Yu
    Xiaohong Huo
    Yang Yang
    Journal of Cheminformatics, 16
  • [2] Pre-training on Large-Scale Heterogeneous Graph
    Jiang, Xunqiang
    Jia, Tianrui
    Fang, Yuan
    Shi, Chuan
    Lin, Zhe
    Wang, Hui
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 756 - 766
  • [3] Pre-Training General Trajectory Embeddings With Maximum Multi-View Entropy Coding
    Lin, Yan
    Wan, Huaiyu
    Guo, Shengnan
    Hu, Jilin
    Jensen, Christian S.
    Lin, Youfang
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (12) : 9037 - 9050
  • [4] SelfPAB: large-scale pre-training on accelerometer data for human activity recognition
    Logacjov, Aleksej
    Herland, Sverre
    Ustad, Astrid
    Bach, Kerstin
    APPLIED INTELLIGENCE, 2024, 54 (06) : 4545 - 4563
  • [5] SelfPAB: large-scale pre-training on accelerometer data for human activity recognition
    Aleksej Logacjov
    Sverre Herland
    Astrid Ustad
    Kerstin Bach
    Applied Intelligence, 2024, 54 : 4545 - 4563
  • [6] Large-scale online multi-view graph neural network and applications
    Li, Zhao
    Xing, Yuying
    Huang, Jiaming
    Wang, Haobo
    Gao, Jianliang
    Yu, Guoxian
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2021, 116 : 145 - 155
  • [7] Efficient Supervised Discrete Multi-View Hashing for Large-Scale Multimedia Search
    Lu, Xu
    Zhu, Lei
    Li, Jingjing
    Zhang, Huaxiang
    Shen, Heng Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (08) : 2048 - 2060
  • [8] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
    Chen, Sanyuan
    Wang, Chengyi
    Chen, Zhengyang
    Wu, Yu
    Liu, Shujie
    Chen, Zhuo
    Li, Jinyu
    Kanda, Naoyuki
    Yoshioka, Takuya
    Xiao, Xiong
    Wu, Jian
    Zhou, Long
    Ren, Shuo
    Qian, Yanmin
    Qian, Yao
    Zeng, Michael
    Yu, Xiangzhan
    Wei, Furu
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1505 - 1518
  • [9] Thyroid ultrasound diagnosis improvement via multi-view self-supervised learning and two-stage pre-training
    Wang, Jian
    Yang, Xin
    Jia, Xiaohong
    Xue, Wufeng
    Chen, Rusi
    Chen, Yanlin
    Zhu, Xiliang
    Liu, Lian
    Cao, Yan
    Zhou, Jianqiao
    Ni, Dong
    Gu, Ning
    COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 171
  • [10] HySparK: Hybrid Sparse Masking for Large Scale Medical Image Pre-training
    Tang, Fenghe
    Xu, Ronghao
    Yao, Qingsong
    Fu, Xueming
    Quan, Quan
    Zhu, Heqin
    Liu, Zaiyi
    Zhou, S. Kevin
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT XI, 2024, 15011 : 330 - 340