Learning functional properties of proteins with language models

被引:88
作者
Unsal, Serbulent [1 ,2 ]
Atas, Heval [1 ]
Albayrak, Muammer [2 ]
Turhan, Kemal [2 ]
Acar, Aybar C. [1 ]
Dogan, Tunca [1 ,3 ,4 ]
机构
[1] Middle East Tech Univ, Grad Sch Informat, Canc Syst Biol Lab KanSiL, Ankara, Turkey
[2] Karadeniz Tech Univ, Dept Biostat & Med Informat, Trabzon, Turkey
[3] Hacettepe Univ, Dept Comp Engn, Ankara, Turkey
[4] Hacettepe Univ, Inst Informat, Ankara, Turkey
关键词
MOLECULAR-DYNAMICS SIMULATIONS; INTERACTION PREDICTION; STATISTICAL-MECHANICS; COMPUTATIONAL DESIGN; SEMANTIC SIMILARITY; POTENTIALS; DATABASE; TERMS;
D O I
10.1038/s42256-022-00457-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Deep learning methods have in recent years shown promising results in characterizing proteins and extracting complex sequence-structure-function relationships. This Analysis describes a benchmarking study to compare the performances and advantages of recent deep learning approaches in a range of protein prediction tasks. Data-centric approaches have been used to develop predictive methods for elucidating uncharacterized properties of proteins; however, studies indicate that these methods should be further improved to effectively solve critical problems in biomedicine and biotechnology, which can be achieved by better representing the data at hand. Novel data representation approaches mostly take inspiration from language models that have yielded ground-breaking improvements in natural language processing. Lately, these approaches have been applied to the field of protein science and have displayed highly promising results in terms of extracting complex sequence-structure-function relationships. In this study we conducted a detailed investigation over protein representation learning by first categorizing/explaining each approach, subsequently benchmarking their performances on predicting: (1) semantic similarities between proteins, (2) ontology-based protein functions, (3) drug target protein families and (4) protein-protein binding affinity changes following mutations. We evaluate and discuss the advantages and disadvantages of each method over the benchmark results, source datasets and algorithms used, in comparison with classical model-driven approaches. Finally, we discuss current challenges and suggest future directions. We believe that the conclusions of this study will help researchers to apply machine/deep learning-based representation techniques to protein data for various predictive tasks, and inspire the development of novel methods.
引用
收藏
页码:227 / 245
页数:19
相关论文
共 117 条
  • [1] Unified rational protein engineering with sequence-based deep representation learning
    Alley, Ethan C.
    Khimulya, Grigory
    Biswas, Surojit
    AlQuraishi, Mohammed
    Church, George M.
    [J]. NATURE METHODS, 2019, 16 (12) : 1315 - +
  • [2] [Anonymous], 2016, DEEP LEARNING FEATUR, DOI DOI 10.1101/086033V1
  • [3] Asgari E., 2019, DeepPrime2Sec: Deep Learning for Protein Secondary Structure Prediction from the Primary Sequences
  • [4] Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)
    Asgari, Ehsaneddin
    McHardy, Alice C.
    Mofrad, Mohammad R. K.
    [J]. SCIENTIFIC REPORTS, 2019, 9 (1)
  • [5] Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics
    Asgari, Ehsaneddin
    Mofrad, Mohammad R. K.
    [J]. PLOS ONE, 2015, 10 (11):
  • [6] Accurate prediction of protein structures and interactions using a three-track neural network
    Baek, Minkyung
    DiMaio, Frank
    Anishchenko, Ivan
    Dauparas, Justas
    Ovchinnikov, Sergey
    Lee, Gyu Rie
    Wang, Jue
    Cong, Qian
    Kinch, Lisa N.
    Schaeffer, R. Dustin
    Millan, Claudia
    Park, Hahnbeom
    Adams, Carson
    Glassman, Caleb R.
    DeGiovanni, Andy
    Pereira, Jose H.
    Rodrigues, Andria V.
    van Dijk, Alberdina A.
    Ebrecht, Ana C.
    Opperman, Diederik J.
    Sagmeister, Theo
    Buhlheller, Christoph
    Pavkov-Keller, Tea
    Rathinaswamy, Manoj K.
    Dalwadi, Udit
    Yip, Calvin K.
    Burke, John E.
    Garcia, K. Christopher
    Grishin, Nick V.
    Adams, Paul D.
    Read, Randy J.
    Baker, David
    [J]. SCIENCE, 2021, 373 (6557) : 871 - +
  • [7] An exciting but challenging road ahead for computational enzyme design
    Baker, David
    [J]. PROTEIN SCIENCE, 2010, 19 (10) : 1817 - 1819
  • [8] UniProt: a hub for protein information
    Bateman, Alex
    Martin, Maria Jesus
    O'Donovan, Claire
    Magrane, Michele
    Apweiler, Rolf
    Alpi, Emanuele
    Antunes, Ricardo
    Arganiska, Joanna
    Bely, Benoit
    Bingley, Mark
    Bonilla, Carlos
    Britto, Ramona
    Bursteinas, Borisas
    Chavali, Gayatri
    Cibrian-Uhalte, Elena
    Da Silva, Alan
    De Giorgi, Maurizio
    Dogan, Tunca
    Fazzini, Francesco
    Gane, Paul
    Cas-tro, Leyla Garcia
    Garmiri, Penelope
    Hatton-Ellis, Emma
    Hieta, Reija
    Huntley, Rachael
    Legge, Duncan
    Liu, Wudong
    Luo, Jie
    MacDougall, Alistair
    Mutowo, Prudence
    Nightin-gale, Andrew
    Orchard, Sandra
    Pichler, Klemens
    Poggioli, Diego
    Pundir, Sangya
    Pureza, Luis
    Qi, Guoying
    Rosanoff, Steven
    Saidi, Rabie
    Sawford, Tony
    Shypitsyna, Aleksandra
    Turner, Edward
    Volynkin, Vladimir
    Wardell, Tony
    Watkins, Xavier
    Zellner, Hermann
    Cowley, Andrew
    Figueira, Luis
    Li, Weizhong
    McWilliam, Hamish
    [J]. NUCLEIC ACIDS RESEARCH, 2015, 43 (D1) : D204 - D212
  • [9] Bepler T., 2019, INT C LEARN REPR 201
  • [10] Using deep learning to annotate the protein universe
    Bileschi, Maxwell L.
    Belanger, David
    Bryant, Drew
    Sanderson, Theo
    Carter, Brandon
    Sculley, D.
    Bateman, Alex
    DePristo, Mark A.
    Colwell, Lucy J.
    [J]. NATURE BIOTECHNOLOGY, 2022, 40 (06) : 932 - +