LMNglyPred: prediction of human N-linked glycosylation sites using embeddings from a pre-trained protein language model

被引:14
|
作者
Pakhrin, Subash C. [1 ,2 ]
Pokharel, Suresh [3 ]
Aoki-Kinoshita, Kiyoko F. [4 ]
Beck, Moriah R. [5 ]
Dam, Tarun K. [6 ]
Caragea, Doina [7 ]
Kc, Dukka B. [3 ]
机构
[1] Wichita State Univ, Sch Comp, 1845 Fairmount St, Wichita, KS 67260 USA
[2] Univ Houston Downtown, Dept Comp Sci & Engn Technol, Houston, TX 77002 USA
[3] Michigan Technol Univ, Coll Comp, Dept Comp Sci, Houghton, MI 49931 USA
[4] Soka Univ, Glycan & Life Syst Integrat Ctr GaLSIC, Tokyo 1928577, Japan
[5] Wichita State Univ, Dept Chem & Biochem, 1845 Fairmount St, Wichita, KS 67260 USA
[6] Kansas State Univ, Dept Chem, Lab Mechanist Glycobiol, Manhattan, KS 66506 USA
[7] Kansas State Univ, Dept Comp Sci, Manhattan, KS 66506 USA
基金
美国国家科学基金会;
关键词
deep learning; N-linked glycosylation; post-translation modification; prediction; protein language model; SEQUENCE; BACTERIAL; SETS;
D O I
10.1093/glycob/cwad033
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Protein N-linked glycosylation is an important post-translational mechanism in Homo sapiens, playing essential roles in many vital biological processes. It occurs at the N-X-[S/T] sequon in amino acid sequences, where X can be any amino acid except proline. However, not all N-X-[S/T] sequons are glycosylated; thus, the N-X-[S/T] sequon is a necessary but not sufficient determinant for protein glycosylation. In this regard, computational prediction of N-linked glycosylation sites confined to N-X-[S/T] sequons is an important problem that has not been extensively addressed by the existing methods, especially in regard to the creation of negative sets and leveraging the distilled information from protein language models (pLMs). Here, we developed LMNglyPred, a deep learning-based approach, to predict N-linked glycosylated sites in human proteins using embeddings from a pre-trained pLM. LMNglyPred produces sensitivity, specificity, Matthews Correlation Coefficient, precision, and accuracy of 76.50, 75.36, 0.49, 60.99, and 75.74 percent, respectively, on a benchmark-independent test set. These results demonstrate that LMNglyPred is a robust computational tool to predict N-linked glycosylation sites confined to the N-X-[S/T] sequon.
引用
收藏
页码:411 / 422
页数:12
相关论文
共 48 条
  • [1] An analytical study on the identification of N-linked glycosylation sites using machine learning model
    Akmal, Muhammad Aizaz
    Hassan, Muhammad Awais
    Shoaib, Muhammad
    Khurshid, Khaldoon S.
    Mohamed, Abdullah
    PEERJ COMPUTER SCIENCE, 2022, 8
  • [2] DeepNGlyPred: A Deep Neural Network-Based Approach for Human N-Linked Glycosylation Site Prediction
    Pakhrin, Subash C.
    Aoki-Kinoshita, Kiyoko F.
    Caragea, Doina
    Dukka, B. K. C.
    MOLECULES, 2021, 26 (23):
  • [3] Identification of N-linked glycosylation sites in human nephrin using mass spectrometry
    Khoshnoodi, Jamshid
    Hill, Salisha
    Tryggvason, Karl
    Hudson, Billy
    Friedman, David B.
    JOURNAL OF MASS SPECTROMETRY, 2007, 42 (03): : 370 - 379
  • [4] LPBERT: A Protein-Protein Interaction Prediction Method Based on a Pre-Trained Language Model
    Hu, An
    Kuang, Linai
    Yang, Dinghai
    APPLIED SCIENCES-BASEL, 2025, 15 (06):
  • [5] PDNAPred: Interpretable prediction of protein-DNA binding sites based on pre-trained protein language models
    Zhang, Lingrong
    Liu, Taigang
    INTERNATIONAL JOURNAL OF BIOLOGICAL MACROMOLECULES, 2024, 281
  • [6] Protein-small molecule binding site prediction based on a pre-trained protein language model with contrastive learning
    Wang, Jue
    Liu, Yufan
    Tian, Boxue
    JOURNAL OF CHEMINFORMATICS, 2024, 16 (01):
  • [7] PreAlgPro: Prediction of allergenic proteins with pre-trained protein language model and efficient neutral network
    Zhang, Lingrong
    Liu, Taigang
    INTERNATIONAL JOURNAL OF BIOLOGICAL MACROMOLECULES, 2024, 280
  • [8] Interpretable Prediction of SARS-CoV-2 Epitope-Specific TCR Recognition Using a Pre-Trained Protein Language Model
    Yoo, Sunyong
    Jeong, Myeonghyeon
    Seomun, Subhin
    Kim, Kiseong
    Han, Youngmahn
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2024, 21 (03) : 428 - 438
  • [9] POOE: predicting oomycete effectors based on a pre-trained large protein language model
    Zhao, Miao
    Lei, Chenping
    Zhou, Kewei
    Huang, Yan
    Fu, Chen
    Yang, Shiping
    Zhang, Ziding
    MSYSTEMS, 2024, 9 (01)
  • [10] Using ensemble SVM to identify human GPCRs N-linked glycosylation sites based on the general form of Chous PseAAC
    Xie, Hua-Lin
    Fu, Liang
    Nie, Xi-Du
    PROTEIN ENGINEERING DESIGN & SELECTION, 2013, 26 (11) : 735 - 742