LMPhosSite: A Deep Learning-Based Approach for General Protein Phosphorylation Site Prediction Using Embeddings from the Local Window Sequence and Pretrained Protein Language Model

被引:12
作者
Pakhrin, Subash C. [1 ,2 ]
Pokharel, Suresh [3 ]
Pratyush, Pawel [3 ]
Chaudhari, Meenal [4 ]
Ismail, Hamid D. [3 ]
Dukka, B. K. C. B. [3 ]
机构
[1] Wichita State Univ, Sch Comp, Wichita, KS 67260 USA
[2] Univ Houston Downtown, Dept Comp Sci & Engn Technol, Houston, TX 77002 USA
[3] Michigan Technol Univ, Dept Comp Sci, Houghton, MI 49931 USA
[4] North Carolina A&T State Univ, Dept Biol, Greensboro, NC 27411 USA
基金
美国国家科学基金会;
关键词
post-translational modification; protein language model; phosphorylation; deep learning; stack generalization; score-level fusion; embedding; RESOURCE; ASSOCIATION; DATABASE;
D O I
10.1021/acs.jproteome.2c00667
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Phosphorylation is one of the most important post-translationalmodifications and plays a pivotal role in various cellular processes.Although there exist several computational tools to predict phosphorylationsites, existing tools have not yet harnessed the knowledge distilledby pretrained protein language models. Herein, we present a noveldeep learning-based approach called LMPhosSite for the general phosphorylationsite prediction that integrates embeddings from the local window sequenceand the contextualized embedding obtained using global (overall) proteinsequence from a pretrained protein language model to improve the predictionperformance. Thus, the LMPhosSite consists of two base-models: onefor capturing effective local representation and the other for capturingglobal per-residue contextualized embedding from a pretrained proteinlanguage model. The output of these base-models is integrated usinga score-level fusion approach. LMPhosSite achieves a precision, recall,Matthew's correlation coefficient, and F1-score of 38.78%, 67.12%,0.390, and 49.15%, for the combined serine and threonine independenttest data set and 34.90%, 62.03%, 0.298, and 44.67%, respectively,for the tyrosine independent test data set, which is better than thecompared approaches. These results demonstrate that LMPhosSite isa robust computational tool for the prediction of the general phosphorylationsites in proteins.
引用
收藏
页码:2548 / 2557
页数:10
相关论文
共 58 条
  • [41] Post-translational modifications in proteins: resources, tools and prediction methods
    Ramazi, Shahin
    Zahiri, Javad
    [J]. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2021,
  • [42] Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
    Rives, Alexander
    Meier, Joshua
    Sercu, Tom
    Goyal, Siddharth
    Lin, Zeming
    Liu, Jason
    Guo, Demi
    Ott, Myle
    Zitnick, C. Lawrence
    Ma, Jerry
    Fergus, Rob
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2021, 118 (15)
  • [43] Covalent Small Ubiquitin-like Modifier (SUMO) Modification of Maf1 Protein Controls RNA Polymerase III-dependent Transcription Repression
    Rohira, Aarti D.
    Chen, Chun-Yuan
    Allen, Justin R.
    Johnson, Deborah L.
    [J]. JOURNAL OF BIOLOGICAL CHEMISTRY, 2013, 288 (26) : 19288 - 19295
  • [44] Computational Identification of Protein Methylation Sites through Bi-Profile Bayes Feature Extraction
    Shao, Jianlin
    Xu, Dong
    Tsai, Sau-Na
    Wang, Yifei
    Ngai, Sai-Ming
    [J]. PLOS ONE, 2009, 4 (03):
  • [45] Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold
    Steinegger, Martin
    Mirdita, Milot
    Soeding, Johannes
    [J]. NATURE METHODS, 2019, 16 (07) : 603 - +
  • [46] The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored
    Szklarczyk, Damian
    Franceschini, Andrea
    Kuhn, Michael
    Simonovic, Milan
    Roth, Alexander
    Minguez, Pablo
    Doerks, Tobias
    Stark, Manuel
    Muller, Jean
    Bork, Peer
    Jensen, Lars J.
    von Mering, Christian
    [J]. NUCLEIC ACIDS RESEARCH, 2011, 39 : D561 - D568
  • [47] Integrated analytical strategies for the study of phosphorylation and glycosylation in proteins
    Temporini, Caterina
    Callerli, Enrica
    Massolini, Gabriella
    Caccialanza, Gabriele
    [J]. MASS SPECTROMETRY REVIEWS, 2008, 27 (03) : 207 - 236
  • [48] SignalP 6.0 predicts all five types of signal peptides using protein language models
    Teufel, Felix
    Almagro Armenteros, Jose Juan
    Johansen, Alexander Rosenberg
    Gislason, Magnus Halldor
    Pihl, Silas Irby
    Tsirigos, Konstantinos D.
    Winther, Ole
    Brunak, Soren
    von Heijne, Gunnar
    Nielsen, Henrik
    [J]. NATURE BIOTECHNOLOGY, 2022, 40 (07) : 1023 - +
  • [49] A deep learning based approach for prediction of Chlamydomonas reinhardtii phosphorylation sites
    Thapa, Niraj
    Chaudhari, Meenal
    Iannetta, Anthony A.
    White, Clarence
    Roy, Kaushik
    Newman, Robert H.
    Hicks, Leslie M.
    Kc, Dukka B.
    [J]. SCIENTIFIC REPORTS, 2021, 11 (01)
  • [50] DeepLoc 2.0: multi-label subcellular localization prediction using protein language models
    Thumuluri, Vineet
    Armenteros, Jose Juan Almagro
    Johansen, Alexander Rosenberg
    Nielsen, Henrik
    Winther, Ole
    [J]. NUCLEIC ACIDS RESEARCH, 2022, 50 (W1) : W228 - W234