ProteinBERT: a universal deep-learning model of protein sequence and function

被引:286
作者
Brandes, Nadav [1 ]
Ofer, Dan [2 ]
Peleg, Yam [3 ]
Rappoport, Nadav [4 ]
Linial, Michal [2 ]
机构
[1] Hebrew Univ Jerusalem, Sch Comp Sci & Engn, IL-9190401 Jerusalem, Israel
[2] Hebrew Univ Jerusalem, Alexander Silberman Inst Life Sci, Dept Biol Chem, IL-9190401 Jerusalem, Israel
[3] Deep Trading Ltd, IL-3508401 Haifa, Israel
[4] Ben Gurion Univ Negev, Fac Engn Sci, Dept Software & Informat Syst Engn, IL-8410501 Beer Sheva, Israel
基金
以色列科学基金会;
关键词
D O I
10.1093/bioinformatics/btac020
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data.
引用
收藏
页码:2102 / 2110
页数:9
相关论文
共 53 条
  • [1] Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
  • [2] Unified rational protein engineering with sequence-based deep representation learning
    Alley, Ethan C.
    Khimulya, Grigory
    Biswas, Surojit
    AlQuraishi, Mohammed
    Church, George M.
    [J]. NATURE METHODS, 2019, 16 (12) : 1315 - +
  • [3] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [4] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [5] The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures
    Andreeva, Antonina
    Kulesha, Eugene
    Gough, Julian
    Murzin, Alexey G.
    [J]. NUCLEIC ACIDS RESEARCH, 2020, 48 (D1) : D376 - D382
  • [6] SCOP2 prototype: a new approach to protein structure mining
    Andreeva, Antonina
    Howorth, Dave
    Chothia, Cyrus
    Kulesha, Eugene
    Murzin, Alexey G.
    [J]. NUCLEIC ACIDS RESEARCH, 2014, 42 (D1) : D310 - D314
  • [7] SignalP 5.0 improves signal peptide predictions using deep neural networks
    Armenteros, Jose Juan Almagro
    Tsirigos, Konstantinos D.
    Sonderby, Casper Kaae
    Petersen, Thomas Nordahl
    Winther, Ole
    Brunak, Soren
    von Heijne, Gunnar
    Nielsen, Henrik
    [J]. NATURE BIOTECHNOLOGY, 2019, 37 (04) : 420 - +
  • [8] Gene Ontology: tool for the unification of biology
    Ashburner, M
    Ball, CA
    Blake, JA
    Botstein, D
    Butler, H
    Cherry, JM
    Davis, AP
    Dolinski, K
    Dwight, SS
    Eppig, JT
    Harris, MA
    Hill, DP
    Issel-Tarver, L
    Kasarskis, A
    Lewis, S
    Matese, JC
    Richardson, JE
    Ringwald, M
    Rubin, GM
    Sherlock, G
    [J]. NATURE GENETICS, 2000, 25 (01) : 25 - 29
  • [9] Bateman A, 2002, NUCLEIC ACIDS RES, V30, P276, DOI [10.1093/nar/gkh121, 10.1093/nar/gkr1065, 10.1093/nar/gkp985]
  • [10] Bepler T., 2019, INT C LEARN REPR 201