ProSE-Pero: Peroxisomal Protein Localization Identification Model Based on Self-Supervised Multi-Task Language Pre-Training Model

被引:0
作者
Sui, Jianan [1 ]
Chen, Jiazi [2 ]
Chen, Yuehui [3 ]
Iwamori, Naoki [2 ]
Sun, Jin [4 ]
机构
[1] Univ Jinan, Sch Informat Sci & Engn, Jinan 250022, Shandong, Peoples R China
[2] Kyushu Univ, Grad Sch Bioresource & Bioenvironm Sci, Lab Zool, Fukuoka, Fukuoka 8190395, Japan
[3] Univ Jinan, Inst & Informat Sci & Engn, Sch Artificial Intelligence, Jinan 250022, Shandong, Peoples R China
[4] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Sichuan, Peoples R China
来源
FRONTIERS IN BIOSCIENCE-LANDMARK | 2023年 / 28卷 / 12期
基金
中国国家自然科学基金;
关键词
peroxisomal localization identification; SVMSMOTE; multitasking language model; feature selection; deep learning; vac-uole proteins identification; VACUOLAR TRANSPORTERS; ALZHEIMERS-DISEASE; MITOCHONDRIAL; INHIBITION; BIOGENESIS; MECHANISM; STRESS;
D O I
10.31083/j.fbl2812322
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Background: Peroxisomes are membrane-bound organelles that contain one or more types of oxidative enzymes. Aberrant localization of peroxisomal proteins can contribute to the development of various diseases. To more accurately identify and locate peroxisomal proteins, we developed the ProSE-Pero model. Methods: We employed three methods based on deep representation learning models to extract the characteristics of peroxisomal proteins and compared their performance. Furthermore, we used the SVMSMOTE balanced dataset, SHAP interpretation model, variance analysis (ANOVA), and light gradient boosting machine (LightGBM) to select and compare the extracted features. We also constructed several traditional machine learning methods and four deep learning models to train and test our model on a dataset of 160 peroxisomal proteins using tenfold cross-validation. Results: Our proposed ProSE-Pero model achieves high performance with a specificity (Sp) of 93.37%, a sensitivity (Sn) of 82.41%, an accuracy (Acc) of 95.77%, a Matthews correlation coefficient (MCC) of 0.8241, an F1 score of 0.8996, and an area under the curve (AUC) of 0.9818. Additionally, we extended our method to identify plant vacuole proteins and achieved an accuracy of 91.90% on the independent test set, which is approximately 5% higher than the latest iPVP-DRLF model. Conclusions: Our model surpasses the existing In-Pero model in terms of peroxisomal protein localization and identification. Additionally, our study showcases the proficient performance of the pre-trained multitasking language model ProSE in extracting features from protein sequences. With its established validity and broad generalization, our model holds considerable potential for expanding its application to the localization and identification of proteins in other organelles, such as mitochondria and Golgi proteins, in future investigations.
引用
收藏
页数:14
相关论文
共 63 条
  • [1] MFSC: Multi-voting based feature selection for classification of Golgi proteins by adopting the general form of Chou's PseAAC components
    Ahmad, Jamal
    Hayat, Maqsood
    [J]. JOURNAL OF THEORETICAL BIOLOGY, 2019, 463 : 99 - 109
  • [2] Unified rational protein engineering with sequence-based deep representation learning
    Alley, Ethan C.
    Khimulya, Grigory
    Biswas, Surojit
    AlQuraishi, Mohammed
    Church, George M.
    [J]. NATURE METHODS, 2019, 16 (12) : 1315 - +
  • [3] Deep learning for computational biology
    Angermueller, Christof
    Parnamaa, Tanel
    Parts, Leopold
    Stegle, Oliver
    [J]. MOLECULAR SYSTEMS BIOLOGY, 2016, 12 (07)
  • [4] In-Pero: Exploiting Deep Learning Embeddings of Protein Sequences to Predict the Localisation of Peroxisomal Proteins
    Anteghini, Marco
    dos Santos, Vitor Martins
    Saccenti, Edoardo
    [J]. INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES, 2021, 22 (12)
  • [5] DENDRITIC CHANGES IN THE BASAL NUCLEUS OF MEYNERT AND IN THE DIAGONAL BAND NUCLEUS IN ALZHEIMERS-DISEASE - A QUANTITATIVE GOLGI INVESTIGATION
    ARENDT, T
    ZVEGINTSEVA, HG
    LEONTOVICH, TA
    [J]. NEUROSCIENCE, 1986, 19 (04) : 1265 - 1278
  • [6] Ether lipid generating enzyme AGPS alters the balance of structural and signaling lipids to fuel cancer pathogenicity
    Benjamin, Daniel I.
    Cozzo, Alyssa
    Ji, Xiaodan
    Roberts, Lindsay S.
    Louie, Sharon M.
    Mulvihill, Melinda M.
    Luo, Kunxin
    Nomura, Daniel K.
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2013, 110 (37) : 14912 - 14917
  • [7] Learning the protein language: Evolution, structure, and function
    Bepler, Tristan
    Berger, Bonnie
    [J]. CELL SYSTEMS, 2021, 12 (06) : 654 - +
  • [8] Peroxisomes in brain development and function
    Berger, Johannes
    Dorninger, Fabian
    Forss-Petter, Sonja
    Kunze, Markus
    [J]. BIOCHIMICA ET BIOPHYSICA ACTA-MOLECULAR CELL RESEARCH, 2016, 1863 (05): : 934 - 955
  • [9] Dopamine oxidation mediates mitochondrial and lysosomal dysfunction in Parkinson's disease
    Burbulla, Lena F.
    Song, Pingping
    Mazzulli, Joseph R.
    Zampese, Enrico
    Wong, Yvette C.
    Jeon, Sohee
    Santos, David P.
    Blanz, Judith
    Obermaier, Carolin D.
    Strojny, Chelsee
    Savas, Jeffrey N.
    Kiskinis, Evangelos
    Zhuang, Xiaoxi
    Kruger, Rejko
    Surmeier, D. James
    Krainc, Dimitri
    [J]. SCIENCE, 2017, 357 (6357) : 1255 - +
  • [10] FASText: Efficient Unconstrained Scene Text Detector
    Busta, Michal
    Neumann, Lukas
    Matas, Jiri
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1206 - 1214