Benchmarking protein language models for protein crystallization

被引:0
|
作者
Mall, Raghvendra [1 ]
Kaushik, Rahul [1 ]
Martinez, Zachary A. [2 ]
Thomson, Matt W. [2 ]
Castiglione, Filippo [1 ,3 ]
机构
[1] Technol Innovat Inst, Biotechnol Res Ctr, POB 9639, Abu Dhabi, U Arab Emirates
[2] CALTECH, Div Biol & Bioengn, Pasadena, CA 91125 USA
[3] Natl Res Council Italy, Inst Appl Comp, I-00185 Rome, Italy
来源
SCIENTIFIC REPORTS | 2025年 / 15卷 / 01期
关键词
Open protein language models (PLMs); Protein crystallization; Benchmarking; Protein generation; PROPENSITY PREDICTION; REFINEMENT;
D O I
10.1038/s41598-025-86519-5
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The problem of protein structure determination is usually solved by X-ray crystallography. Several in silico deep learning methods have been developed to overcome the high attrition rate, cost of experiments and extensive trial-and-error settings, for predicting the crystallization propensities of proteins based on their sequences. In this work, we benchmark the power of open protein language models (PLMs) through the TRILL platform, a be-spoke framework democratizing the usage of PLMs for the task of predicting crystallization propensities of proteins. By comparing LightGBM / XGBoost classifiers built on the average embedding representations of proteins learned by different PLMs, such as ESM2, Ankh, ProtT5-XL, ProstT5, xTrimoPGLM, SaProt with the performance of state-of-the-art sequence-based methods like DeepCrystal, ATTCrys and CLPred, we identify the most effective methods for predicting crystallization outcomes. The LightGBM classifiers utilizing embeddings from ESM2 model with 30 and 36 transformer layers and 150 and 3000 million parameters respectively have performance gains by 3-\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$5\%$$\end{document} than all compared models for various evaluation metrics, including AUPR (Area Under Precision-Recall Curve), AUC (Area Under the Receiver Operating Characteristic Curve), and F1 on independent test sets. Furthermore, we fine-tune the ProtGPT2 model available via TRILL to generate crystallizable proteins. Starting with 3000 generated proteins and through a step of filtration processes including consensus of all open PLM-based classifiers, sequence identity through CD-HIT, secondary structure compatibility, aggregation screening, homology search and foldability evaluation, we identified a set of 5 novel proteins as potentially crystallizable.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] PLMC: Language Model of Protein Sequences Enhances Protein Crystallization Prediction
    Xiong, Dapeng
    Kaicheng, U.
    Sun, Jianfeng
    Cribbs, Adam P.
    INTERDISCIPLINARY SCIENCES-COMPUTATIONAL LIFE SCIENCES, 2024, 16 (04) : 802 - 813
  • [2] Protein crystallization
    Mei Li
    Wen-rui Chang
    Photosynthesis Research, 2009, 102 : 223 - 229
  • [3] Protein surface modifying agents in protein crystallization
    Hasek, Jindrich
    Skalova, Tereza
    Kolenko, Petr
    Duskova, Jarmila
    Koval, Tomas
    Fejfarova, Karla
    Stransky, Jan
    Dohnalek, Jan
    ACTA CRYSTALLOGRAPHICA A-FOUNDATION AND ADVANCES, 2015, 71 : S285 - S285
  • [4] Protein crystallization in vivo
    Doye, Jonathan P. K.
    Poon, Wilson C. K.
    CURRENT OPINION IN COLLOID & INTERFACE SCIENCE, 2006, 11 (01) : 40 - 46
  • [5] Preparative Protein Crystallization
    Hubbuch, Juergen
    Kind, Matthias
    Nirschl, Hermann
    CHEMICAL ENGINEERING & TECHNOLOGY, 2019, 42 (11) : 2275 - 2281
  • [6] Benchmarking Methods of Protein Structure Alignment
    Janan Sykes
    Barbara R. Holland
    Michael A. Charleston
    Journal of Molecular Evolution, 2020, 88 : 575 - 597
  • [7] Benchmarking Methods of Protein Structure Alignment
    Sykes, Janan
    Holland, Barbara R.
    Charleston, Michael A.
    JOURNAL OF MOLECULAR EVOLUTION, 2020, 88 (07) : 575 - 597
  • [8] Benchmarking Biomedical Relation Knowledge in Large Language Models
    Zhang, Fenghui
    Yang, Kuo
    Zhao, Chenqian
    Li, Haixu
    Dong, Xin
    Tian, Haoyu
    Zhou, Xuezhong
    BIOINFORMATICS RESEARCH AND APPLICATIONS, PT II, ISBRA 2024, 2024, 14955 : 482 - 495
  • [9] HiBenchLLM: Historical Inquiry Benchmarking for Large Language Models
    Chartier, Mathieu
    Dakkoune, Nabil
    Bourgeois, Guillaume
    Jean, Stephane
    DATA & KNOWLEDGE ENGINEERING, 2025, 156
  • [10] Patterns of protein - protein interactions in salt solutions and implications for protein crystallization
    Dumetz, Andre C.
    Snellinger-O'Brien, Ann M.
    Kaler, Eric W.
    Lenhoff, Abraham M.
    PROTEIN SCIENCE, 2007, 16 (09) : 1867 - 1877