Identifying Protein-Nucleotide Binding Residues via Grouped Multi-task Learning and Pre-trained Protein Language Models

被引：0

作者：

Wu, Jiashun ^{[1
]}

Liu, Yan ^{[2
]}

Zhang, Ying ^{[1
]}

Wang, Xiaoyu ^{[3
,4
]}

Yan, He ^{[5
]}

Zhu, Yiheng ^{[6
]}

Song, Jiangning ^{[3
,4
,7
]}

Yu, Dong-Jun ^{[1
]}

机构：

[1] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Peoples R China

[2] Yangzhou Univ, Sch Informat Engn, Yangzhou 225100, Peoples R China

[3] Monash Univ, Monash Biomed Discovery Inst, Melbourne, Vic 3800, Australia

[4] Monash Univ, Dept Biochem & Mol Biol, Melbourne, Vic 3800, Australia

[5] Nanjing Forestry Univ, Coll Informat Sci & Technol & Artificial Intellige, Nanjing 210037, Peoples R China

[6] Nanjing Agr Univ, Coll Artificial Intelligence, Nanjing 210095, Peoples R China

[7] Monash Univ, Monash Data Futures Inst, Melbourne, Vic 3800, Australia

来源：

JOURNAL OF CHEMICAL INFORMATION AND MODELING | 2025年 / 65卷 / 02期

基金：

中国国家自然科学基金;

关键词：

PREDICTION; SEQUENCE; SITES;

D O I：

10.1021/acs.jcim.4c02092

中图分类号：

R914 [药物化学];

学科分类号：

100701 ;

摘要：

The accurate identification of protein-nucleotide binding residues is crucial for protein function annotation and drug discovery. Numerous computational methods have been proposed to predict these binding residues, achieving remarkable performance. However, due to the limited availability and high variability of nucleotides, predicting binding residues for diverse nucleotides remains a significant challenge. To address these, we propose NucGMTL, a new grouped deep multi-task learning approach designed for predicting binding residues of all observed nucleotides in the BioLiP database. NucGMTL leverages pre-trained protein language models to generate robust sequence embedding and incorporates multi-scale learning along with scale-based self-attention mechanisms to capture a broader range of feature dependencies. To effectively harness the shared binding patterns across various nucleotides, deep multi-task learning is utilized to distill common representations, taking advantage of auxiliary information from similar nucleotides selected based on task grouping. Performance evaluation on benchmark data sets shows that NucGMTL achieves an average area under the Precision-Recall curve (AUPRC) of 0.594, surpassing other state-of-the-art methods. Further analyses highlight that the predominant advantage of NucGMTL can be reflected by its effective integration of grouped multi-task learning and pre-trained protein language models. The data set and source code are freely accessible at: https://github.com/jerry1984Y/NucGMTL.

引用

页码：1040 / 1052

页数：13

共 11 条

[1] Integration of pre-trained protein language models into geometric deep learning networks
Wu, Fang
Wu, Lirong
Radev, Dragomir
Xu, Jinbo
Li, Stan Z.
COMMUNICATIONS BIOLOGY, 2023, 6 (01)
[2] PDNAPred: Interpretable prediction of protein-DNA binding sites based on pre-trained protein language models
Zhang, Lingrong
Liu, Taigang
INTERNATIONAL JOURNAL OF BIOLOGICAL MACROMOLECULES, 2024, 281
[3] Protein-small molecule binding site prediction based on a pre-trained protein language model with contrastive learning
Wang, Jue
Liu, Yufan
Tian, Boxue
JOURNAL OF CHEMINFORMATICS, 2024, 16 (01):
[4] Identification of protein-nucleotide binding residues via graph regularized k-local hyperplane distance nearest neighbor model
Ding, Yijie
Yang, Chao
Tang, Jijun
Guo, Fei
APPLIED INTELLIGENCE, 2022, 52 (06) : 6598 - 6612
[5] PepPFN: protein-peptide binding residues prediction via pre-trained module-based Fourier Network
Li, Xue
Cao, Ben
Ding, Hongzhen
Kang, Na
Song, Tao
2024 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI 2024, 2024, : 1075 - 1080
[6] UniproLcad: Accurate Identification of Antimicrobial Peptide by Fusing Multiple Pre-Trained Protein Language Models
Wang, Xiao
Wu, Zhou
Wang, Rong
Gao, Xu
SYMMETRY-BASEL, 2024, 16 (04):
[7] Recent advances in features generation for membrane protein sequences: From multiple sequence alignment to pre-trained language models
Ou, Yu-Yen
Ho, Quang-Thai
Chang, Heng-Ta
PROTEOMICS, 2023, 23 (23-24)
[8] Granular multiple kernel learning for identifying RNA-binding protein residues via integrating sequence and structure information
Yang, Chao
Ding, Yijie
Meng, Qiaozhen
Tang, Jijun
Guo, Fei
NEURAL COMPUTING & APPLICATIONS, 2021, 33 (17) : 11387 - 11399
[9] VesiMCNN: Using pre-trained protein language models and multiple window scanning convolutional neural networks to identify vesicular transport proteins
Le, Van The
Tseng, Yi-Hsuan
Liu, Yu-Chen
Malik, Muhammad Shahid
Ou, Yu-Yen
INTERNATIONAL JOURNAL OF BIOLOGICAL MACROMOLECULES, 2024, 280
[10] CasPro-ESM2: Accurate identification of Cas proteins integrating pre-trained protein language model and multi-scale convolutional neural network
Yan, Chaorui
Zhang, Zilong
Xu, Junlin
Meng, Yajie
Yan, Shankai
Wei, Leyi
Zou, Quan
Zhang, Qingchen
Cui, Feifei
INTERNATIONAL JOURNAL OF BIOLOGICAL MACROMOLECULES, 2025, 308

← 1 2 →