Identifying Protein-Nucleotide Binding Residues via Grouped Multi-task Learning and Pre-trained Protein Language Models

被引:0
|
作者
Wu, Jiashun [1 ]
Liu, Yan [2 ]
Zhang, Ying [1 ]
Wang, Xiaoyu [3 ,4 ]
Yan, He [5 ]
Zhu, Yiheng [6 ]
Song, Jiangning [3 ,4 ,7 ]
Yu, Dong-Jun [1 ]
机构
[1] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Peoples R China
[2] Yangzhou Univ, Sch Informat Engn, Yangzhou 225100, Peoples R China
[3] Monash Univ, Monash Biomed Discovery Inst, Melbourne, Vic 3800, Australia
[4] Monash Univ, Dept Biochem & Mol Biol, Melbourne, Vic 3800, Australia
[5] Nanjing Forestry Univ, Coll Informat Sci & Technol & Artificial Intellige, Nanjing 210037, Peoples R China
[6] Nanjing Agr Univ, Coll Artificial Intelligence, Nanjing 210095, Peoples R China
[7] Monash Univ, Monash Data Futures Inst, Melbourne, Vic 3800, Australia
基金
中国国家自然科学基金;
关键词
PREDICTION; SEQUENCE; SITES;
D O I
10.1021/acs.jcim.4c02092
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
The accurate identification of protein-nucleotide binding residues is crucial for protein function annotation and drug discovery. Numerous computational methods have been proposed to predict these binding residues, achieving remarkable performance. However, due to the limited availability and high variability of nucleotides, predicting binding residues for diverse nucleotides remains a significant challenge. To address these, we propose NucGMTL, a new grouped deep multi-task learning approach designed for predicting binding residues of all observed nucleotides in the BioLiP database. NucGMTL leverages pre-trained protein language models to generate robust sequence embedding and incorporates multi-scale learning along with scale-based self-attention mechanisms to capture a broader range of feature dependencies. To effectively harness the shared binding patterns across various nucleotides, deep multi-task learning is utilized to distill common representations, taking advantage of auxiliary information from similar nucleotides selected based on task grouping. Performance evaluation on benchmark data sets shows that NucGMTL achieves an average area under the Precision-Recall curve (AUPRC) of 0.594, surpassing other state-of-the-art methods. Further analyses highlight that the predominant advantage of NucGMTL can be reflected by its effective integration of grouped multi-task learning and pre-trained protein language models. The data set and source code are freely accessible at: https://github.com/jerry1984Y/NucGMTL.
引用
收藏
页码:1040 / 1052
页数:13
相关论文
共 11 条
  • [1] Integration of pre-trained protein language models into geometric deep learning networks
    Wu, Fang
    Wu, Lirong
    Radev, Dragomir
    Xu, Jinbo
    Li, Stan Z.
    COMMUNICATIONS BIOLOGY, 2023, 6 (01)
  • [2] PDNAPred: Interpretable prediction of protein-DNA binding sites based on pre-trained protein language models
    Zhang, Lingrong
    Liu, Taigang
    INTERNATIONAL JOURNAL OF BIOLOGICAL MACROMOLECULES, 2024, 281
  • [3] Protein-small molecule binding site prediction based on a pre-trained protein language model with contrastive learning
    Wang, Jue
    Liu, Yufan
    Tian, Boxue
    JOURNAL OF CHEMINFORMATICS, 2024, 16 (01):
  • [4] Identification of protein-nucleotide binding residues via graph regularized k-local hyperplane distance nearest neighbor model
    Ding, Yijie
    Yang, Chao
    Tang, Jijun
    Guo, Fei
    APPLIED INTELLIGENCE, 2022, 52 (06) : 6598 - 6612
  • [5] PepPFN: protein-peptide binding residues prediction via pre-trained module-based Fourier Network
    Li, Xue
    Cao, Ben
    Ding, Hongzhen
    Kang, Na
    Song, Tao
    2024 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI 2024, 2024, : 1075 - 1080
  • [6] UniproLcad: Accurate Identification of Antimicrobial Peptide by Fusing Multiple Pre-Trained Protein Language Models
    Wang, Xiao
    Wu, Zhou
    Wang, Rong
    Gao, Xu
    SYMMETRY-BASEL, 2024, 16 (04):
  • [7] Recent advances in features generation for membrane protein sequences: From multiple sequence alignment to pre-trained language models
    Ou, Yu-Yen
    Ho, Quang-Thai
    Chang, Heng-Ta
    PROTEOMICS, 2023, 23 (23-24)
  • [8] Granular multiple kernel learning for identifying RNA-binding protein residues via integrating sequence and structure information
    Yang, Chao
    Ding, Yijie
    Meng, Qiaozhen
    Tang, Jijun
    Guo, Fei
    NEURAL COMPUTING & APPLICATIONS, 2021, 33 (17) : 11387 - 11399
  • [9] VesiMCNN: Using pre-trained protein language models and multiple window scanning convolutional neural networks to identify vesicular transport proteins
    Le, Van The
    Tseng, Yi-Hsuan
    Liu, Yu-Chen
    Malik, Muhammad Shahid
    Ou, Yu-Yen
    INTERNATIONAL JOURNAL OF BIOLOGICAL MACROMOLECULES, 2024, 280
  • [10] CasPro-ESM2: Accurate identification of Cas proteins integrating pre-trained protein language model and multi-scale convolutional neural network
    Yan, Chaorui
    Zhang, Zilong
    Xu, Junlin
    Meng, Yajie
    Yan, Shankai
    Wei, Leyi
    Zou, Quan
    Zhang, Qingchen
    Cui, Feifei
    INTERNATIONAL JOURNAL OF BIOLOGICAL MACROMOLECULES, 2025, 308