Porter 6: Protein Secondary Structure Prediction by Leveraging Pre-Trained Language Models (PLMs)

被引:3
作者
Alanazi, Wafa [1 ,2 ]
Meng, Di [1 ]
Pollastri, Gianluca [1 ]
机构
[1] Univ Coll Dublin UCD, Sch Comp Sci, Dublin D04V1W8, Ireland
[2] Northern Border Univ, Coll Sci, Dept Comp Sci, POB 2014, Ar Ar, Saudi Arabia
基金
爱尔兰科学基金会;
关键词
protein structure prediction; structural bioinformatics; bioinformatics; natural language processing; computational biology; deep learning;
D O I
10.3390/ijms26010130
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Accurately predicting protein secondary structure (PSSP) is crucial for understanding protein function, which is foundational to advancements in drug development, disease treatment, and biotechnology. Researchers gain critical insights into protein folding and function within cells by predicting protein secondary structures. The advent of deep learning models, capable of processing complex sequence data and identifying meaningful patterns, offer substantial potential to enhance the accuracy and efficiency of protein structure predictions. In particular, recent breakthroughs in deep learning-driven by the integration of natural language processing (NLP) algorithms-have significantly advanced the field of protein research. Inspired by the remarkable success of NLP techniques, this study harnesses the power of pre-trained language models (PLMs) to advance PSSP prediction. We conduct a comprehensive evaluation of various deep learning models trained on distinct sequence embeddings, including one-hot encoding and PLM-based approaches such as ProtTrans and ESM-2, to develop a cutting-edge prediction system optimized for accuracy and computational efficiency. Our proposed model, Porter 6, is an ensemble of CBRNN-based predictors, leveraging the protein language model ESM-2 as input features. Porter 6 achieves outstanding performance on large-scale, independent test sets. On a 2022 test set, the model attains an impressive 86.60% accuracy in three-state (Q3) and 76.43% in eight-state (Q8) classifications. When tested on a more recent 2024 test set, Porter 6 maintains robust performance, achieving 84.56% in Q3 and 74.18% in Q8 classifications. This represents a significant 3% improvement over its predecessor, outperforming or matching state-of-the-art approaches in the field.
引用
收藏
页数:16
相关论文
共 20 条
[1]  
ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
[2]   Before and after AlphaFold2: An overview of protein structure prediction [J].
Bertoline, Leticia M. F. ;
Lima, Angelica N. ;
Krieger, Jose E. ;
Teixeira, Samantha K. .
FRONTIERS IN BIOINFORMATICS, 2023, 3
[3]   Protein Data Bank: the single global archive for 3D macromolecular structure data [J].
Burley, Stephen K. ;
Berman, Helen M. ;
Bhikadiya, Charmi ;
Bi, Chunxiao ;
Chen, Li ;
Di Costanzo, Luigi ;
Christie, Cole ;
Duarte, Jose M. ;
Dutta, Shuchismita ;
Feng, Zukang ;
Ghosh, Sutapa ;
Goodsell, David S. ;
Green, Rachel Kramer ;
Guranovic, Vladimir ;
Guzenko, Dmytro ;
Hudson, Brian P. ;
Liang, Yuhe ;
Lowe, Robert ;
Peisach, Ezra ;
Periskova, Irina ;
Randle, Chris ;
Rose, Alexander ;
Sekharan, Monica ;
Shao, Chenghua ;
Tao, Yi-Ping ;
Valasatava, Yana ;
Voigt, Maria ;
Westbrook, John ;
Young, Jasmine ;
Zardecki, Christine ;
Zhuravleva, Marina ;
Kurisu, Genji ;
Nakamura, Haruki ;
Kengaku, Yumiko ;
Cho, Hasumi ;
Sato, Junko ;
Kim, Ju Yaen ;
Ikegawa, Yasuyo ;
Nakagawa, Atsushi ;
Yamashita, Reiko ;
Kudou, Takahiro ;
Bekker, Gert-Jan ;
Suzuki, Hirofumi ;
Iwata, Takeshi ;
Yokochi, Masashi ;
Kobayashi, Naohiro ;
Fujiwara, Toshimichi ;
Velankar, Sameer ;
Kleywegt, Gerard J. ;
Anyango, Stephen .
NUCLEIC ACIDS RESEARCH, 2019, 47 (D1) :D520-D528
[4]  
Elnaggar A., 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence, DOI [10.1101/2020.07.12.199554, DOI 10.1109/TPAMI.2021.3095381]
[5]   Discovering the Ultimate Limits of Protein Secondary Structure Prediction [J].
Ho, Chia-Tzu ;
Huang, Yu-Wei ;
Chen, Teng-Ruei ;
Lo, Chia-Hua ;
Lo, Wei-Cheng .
BIOMOLECULES, 2021, 11 (11)
[6]   NetSurfP-3.0: accurate and fast prediction of protein structural features by protein language models and deep learning [J].
Hoie, Magnus Haraldson ;
Kiehl, Erik Nicolas ;
Petersen, Bent ;
Nielsen, Morten ;
Winther, Ole ;
Nielsen, Henrik ;
Hallgren, Jeppe ;
Marcatili, Paolo .
NUCLEIC ACIDS RESEARCH, 2022, 50 (W1) :W510-W515
[7]   Deep learning for protein secondary structure prediction: Pre and post-AlphaFold [J].
Ismi, Dewi Pramudi ;
Pulungan, Reza ;
Afiahayatia .
COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2022, 20 :6271-6286
[8]   Highly accurate protein structure prediction with AlphaFold [J].
Jumper, John ;
Evans, Richard ;
Pritzel, Alexander ;
Green, Tim ;
Figurnov, Michael ;
Ronneberger, Olaf ;
Tunyasuvunakool, Kathryn ;
Bates, Russ ;
Zidek, Augustin ;
Potapenko, Anna ;
Bridgland, Alex ;
Meyer, Clemens ;
Kohl, Simon A. A. ;
Ballard, Andrew J. ;
Cowie, Andrew ;
Romera-Paredes, Bernardino ;
Nikolov, Stanislav ;
Jain, Rishub ;
Adler, Jonas ;
Back, Trevor ;
Petersen, Stig ;
Reiman, David ;
Clancy, Ellen ;
Zielinski, Michal ;
Steinegger, Martin ;
Pacholska, Michalina ;
Berghammer, Tamas ;
Bodenstein, Sebastian ;
Silver, David ;
Vinyals, Oriol ;
Senior, Andrew W. ;
Kavukcuoglu, Koray ;
Kohli, Pushmeet ;
Hassabis, Demis .
NATURE, 2021, 596 (7873) :583-+
[9]   DICTIONARY OF PROTEIN SECONDARY STRUCTURE - PATTERN-RECOGNITION OF HYDROGEN-BONDED AND GEOMETRICAL FEATURES [J].
KABSCH, W ;
SANDER, C .
BIOPOLYMERS, 1983, 22 (12) :2577-2637
[10]   NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning [J].
Klausen, Michael Schantz ;
Jespersen, Martin Closter ;
Nielsen, Henrik ;
Jensen, Kamilla Kjaergaard ;
Jurtz, Vanessa Isabell ;
Sonderby, Casper Kaae ;
Sommer, Morten Otto Alexander ;
Winther, Ole ;
Nielsen, Morten ;
Petersen, Bent ;
Marcatili, Paolo .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2019, 87 (06) :520-527