Learning functional properties of proteins with language models

被引：88

作者：

Unsal, Serbulent ^{[1
,2
]}

Atas, Heval ^{[1
]}

Albayrak, Muammer ^{[2
]}

Turhan, Kemal ^{[2
]}

Acar, Aybar C. ^{[1
]}

Dogan, Tunca ^{[1
,3
,4
]}

机构：

[1] Middle East Tech Univ, Grad Sch Informat, Canc Syst Biol Lab KanSiL, Ankara, Turkey

[2] Karadeniz Tech Univ, Dept Biostat & Med Informat, Trabzon, Turkey

[3] Hacettepe Univ, Dept Comp Engn, Ankara, Turkey

[4] Hacettepe Univ, Inst Informat, Ankara, Turkey

来源：

NATURE MACHINE INTELLIGENCE | 2022年 / 4卷 / 03期

关键词：

MOLECULAR-DYNAMICS SIMULATIONS; INTERACTION PREDICTION; STATISTICAL-MECHANICS; COMPUTATIONAL DESIGN; SEMANTIC SIMILARITY; POTENTIALS; DATABASE; TERMS;

D O I：

10.1038/s42256-022-00457-9

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Deep learning methods have in recent years shown promising results in characterizing proteins and extracting complex sequence-structure-function relationships. This Analysis describes a benchmarking study to compare the performances and advantages of recent deep learning approaches in a range of protein prediction tasks. Data-centric approaches have been used to develop predictive methods for elucidating uncharacterized properties of proteins; however, studies indicate that these methods should be further improved to effectively solve critical problems in biomedicine and biotechnology, which can be achieved by better representing the data at hand. Novel data representation approaches mostly take inspiration from language models that have yielded ground-breaking improvements in natural language processing. Lately, these approaches have been applied to the field of protein science and have displayed highly promising results in terms of extracting complex sequence-structure-function relationships. In this study we conducted a detailed investigation over protein representation learning by first categorizing/explaining each approach, subsequently benchmarking their performances on predicting: (1) semantic similarities between proteins, (2) ontology-based protein functions, (3) drug target protein families and (4) protein-protein binding affinity changes following mutations. We evaluate and discuss the advantages and disadvantages of each method over the benchmark results, source datasets and algorithms used, in comparison with classical model-driven approaches. Finally, we discuss current challenges and suggest future directions. We believe that the conclusions of this study will help researchers to apply machine/deep learning-based representation techniques to protein data for various predictive tasks, and inspire the development of novel methods.

引用

页码：227 / 245

页数：19

共 117 条

[1] Unified rational protein engineering with sequence-based deep representation learning
Alley, Ethan C.
Khimulya, Grigory
Biswas, Surojit
AlQuraishi, Mohammed
Church, George M.
[J]. NATURE METHODS, 2019, 16 (12) : 1315 - +
[2] [Anonymous], 2016, DEEP LEARNING FEATUR, DOI DOI 10.1101/086033V1
[3] Asgari E., 2019, DeepPrime2Sec: Deep Learning for Protein Secondary Structure Prediction from the Primary Sequences
[4] Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)
Asgari, Ehsaneddin
McHardy, Alice C.
Mofrad, Mohammad R. K.
[J]. SCIENTIFIC REPORTS, 2019, 9 (1)
[5] Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics
Asgari, Ehsaneddin
Mofrad, Mohammad R. K.
[J]. PLOS ONE, 2015, 10 (11):
[6] Accurate prediction of protein structures and interactions using a three-track neural network
Baek, Minkyung
DiMaio, Frank
Anishchenko, Ivan
Dauparas, Justas
Ovchinnikov, Sergey
Lee, Gyu Rie
Wang, Jue
Cong, Qian
Kinch, Lisa N.
Schaeffer, R. Dustin
Millan, Claudia
Park, Hahnbeom
Adams, Carson
Glassman, Caleb R.
DeGiovanni, Andy
Pereira, Jose H.
Rodrigues, Andria V.
van Dijk, Alberdina A.
Ebrecht, Ana C.
Opperman, Diederik J.
Sagmeister, Theo
Buhlheller, Christoph
Pavkov-Keller, Tea
Rathinaswamy, Manoj K.
Dalwadi, Udit
Yip, Calvin K.
Burke, John E.
Garcia, K. Christopher
Grishin, Nick V.
Adams, Paul D.
Read, Randy J.
Baker, David
[J]. SCIENCE, 2021, 373 (6557) : 871 - +
[7] An exciting but challenging road ahead for computational enzyme design
Baker, David
[J]. PROTEIN SCIENCE, 2010, 19 (10) : 1817 - 1819
[8] UniProt: a hub for protein information
Bateman, Alex
Martin, Maria Jesus
O'Donovan, Claire
Magrane, Michele
Apweiler, Rolf
Alpi, Emanuele
Antunes, Ricardo
Arganiska, Joanna
Bely, Benoit
Bingley, Mark
Bonilla, Carlos
Britto, Ramona
Bursteinas, Borisas
Chavali, Gayatri
Cibrian-Uhalte, Elena
Da Silva, Alan
De Giorgi, Maurizio
Dogan, Tunca
Fazzini, Francesco
Gane, Paul
Cas-tro, Leyla Garcia
Garmiri, Penelope
Hatton-Ellis, Emma
Hieta, Reija
Huntley, Rachael
Legge, Duncan
Liu, Wudong
Luo, Jie
MacDougall, Alistair
Mutowo, Prudence
Nightin-gale, Andrew
Orchard, Sandra
Pichler, Klemens
Poggioli, Diego
Pundir, Sangya
Pureza, Luis
Qi, Guoying
Rosanoff, Steven
Saidi, Rabie
Sawford, Tony
Shypitsyna, Aleksandra
Turner, Edward
Volynkin, Vladimir
Wardell, Tony
Watkins, Xavier
Zellner, Hermann
Cowley, Andrew
Figueira, Luis
Li, Weizhong
McWilliam, Hamish
[J]. NUCLEIC ACIDS RESEARCH, 2015, 43 (D1) : D204 - D212
[9] Bepler T., 2019, INT C LEARN REPR 201
[10] Using deep learning to annotate the protein universe
Bileschi, Maxwell L.
Belanger, David
Bryant, Drew
Sanderson, Theo
Carter, Brandon
Sculley, D.
Bateman, Alex
DePristo, Mark A.
Colwell, Lucy J.
[J]. NATURE BIOTECHNOLOGY, 2022, 40 (06) : 932 - +

← 1 2 3 4 5 6 7 8 9 10 →