ProteinBERT: a universal deep-learning model of protein sequence and function

被引:375
作者
Brandes, Nadav [1 ]
Ofer, Dan [2 ]
Peleg, Yam [3 ]
Rappoport, Nadav [4 ]
Linial, Michal [2 ]
机构
[1] Hebrew Univ Jerusalem, Sch Comp Sci & Engn, IL-9190401 Jerusalem, Israel
[2] Hebrew Univ Jerusalem, Alexander Silberman Inst Life Sci, Dept Biol Chem, IL-9190401 Jerusalem, Israel
[3] Deep Trading Ltd, IL-3508401 Haifa, Israel
[4] Ben Gurion Univ Negev, Fac Engn Sci, Dept Software & Informat Syst Engn, IL-8410501 Beer Sheva, Israel
基金
以色列科学基金会;
关键词
D O I
10.1093/bioinformatics/btac020
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data.
引用
收藏
页码:2102 / 2110
页数:9
相关论文
共 53 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]   Unified rational protein engineering with sequence-based deep representation learning [J].
Alley, Ethan C. ;
Khimulya, Grigory ;
Biswas, Surojit ;
AlQuraishi, Mohammed ;
Church, George M. .
NATURE METHODS, 2019, 16 (12) :1315-+
[3]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[4]  
ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
[5]   The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures [J].
Andreeva, Antonina ;
Kulesha, Eugene ;
Gough, Julian ;
Murzin, Alexey G. .
NUCLEIC ACIDS RESEARCH, 2020, 48 (D1) :D376-D382
[6]   SCOP2 prototype: a new approach to protein structure mining [J].
Andreeva, Antonina ;
Howorth, Dave ;
Chothia, Cyrus ;
Kulesha, Eugene ;
Murzin, Alexey G. .
NUCLEIC ACIDS RESEARCH, 2014, 42 (D1) :D310-D314
[7]   SignalP 5.0 improves signal peptide predictions using deep neural networks [J].
Armenteros, Jose Juan Almagro ;
Tsirigos, Konstantinos D. ;
Sonderby, Casper Kaae ;
Petersen, Thomas Nordahl ;
Winther, Ole ;
Brunak, Soren ;
von Heijne, Gunnar ;
Nielsen, Henrik .
NATURE BIOTECHNOLOGY, 2019, 37 (04) :420-+
[8]   Gene Ontology: tool for the unification of biology [J].
Ashburner, M ;
Ball, CA ;
Blake, JA ;
Botstein, D ;
Butler, H ;
Cherry, JM ;
Davis, AP ;
Dolinski, K ;
Dwight, SS ;
Eppig, JT ;
Harris, MA ;
Hill, DP ;
Issel-Tarver, L ;
Kasarskis, A ;
Lewis, S ;
Matese, JC ;
Richardson, JE ;
Ringwald, M ;
Rubin, GM ;
Sherlock, G .
NATURE GENETICS, 2000, 25 (01) :25-29
[9]  
Bateman A, 2002, NUCLEIC ACIDS RES, V30, P276, DOI [10.1093/nar/gkp985, 10.1093/nar/gkr1065, 10.1093/nar/gkh121]
[10]  
Bepler T., 2019, INT C LEARN REPR