Transformer-based deep learning for predicting protein properties in the life sciences

被引:75
作者
Chandra, Abel [1 ]
Tunnermann, Laura [2 ]
Lofstedt, Tommy [1 ]
Gratz, Regina [2 ,3 ]
机构
[1] Umea Univ, Dept Comp Sci, Umea, Sweden
[2] Swedish Univ Agr Sci, Umea Plant Sci Ctr UPSC, Dept Forest Genet & Plant Physiol, Umea, Sweden
[3] Swedish Univ Agr Sci, Dept Forest Ecol & Management, Umea, Sweden
关键词
deep learning; transformers; life sciences; protein property prediction; machine learning; SECONDARY STRUCTURE PREDICTION; DATABASE; LANGUAGE; SITES;
D O I
10.7554/eLife.82819
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Recent developments in deep learning, coupled with an increasing number of sequenced proteins, have led to a breakthrough in life science applications, in particular in protein property prediction. There is hope that deep learning can close the gap between the number of sequenced proteins and proteins with known properties based on lab experiments. Language models from the field of natural language processing have gained popularity for protein property predictions and have led to a new computational revolution in biology, where old prediction results are being improved regularly. Such models can learn useful multipurpose representations of proteins from large open repositories of protein sequences and can be used, for instance, to predict protein properties. The field of natural language processing is growing quickly because of developments in a class of models based on a particular model-the Transformer model. We review recent developments and the use of large-scale Transformer models in applications for predicting protein characteristics and how such models can be used to predict, for example, post-translational modifications. We review shortcomings of other deep learning models and explain how the Transformer models have quickly proven to be a very promising way to unravel information hidden in the sequences of amino acids.
引用
收藏
页数:25
相关论文
共 164 条
[1]   Principal component analysis [J].
Abdi, Herve ;
Williams, Lynne J. .
WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2010, 2 (04) :433-459
[2]   HIPPIE v2.0: enhancing meaningfulness and reliability of protein-protein interaction networks [J].
Alanis-Lobato, Gregorio ;
Andrade-Navarro, Miguel A. ;
Schaefer, Martin H. .
NUCLEIC ACIDS RESEARCH, 2017, 45 (D1) :D408-D414
[3]  
Albawi S, 2017, I C ENG TECHNOL
[4]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[5]   HPIDB 2.0: a curated database for host-pathogen interactions [J].
Ammari, Mais G. ;
Gresham, Cathy R. ;
McCarthy, Fiona M. ;
Nanduri, Bindu .
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2016,
[6]  
[Anonymous], 2017, P 2017 C EMP METH NA, DOI [10.18653/v1/D17-1151, DOI 10.18653/V1/D17-1151]
[7]  
[Anonymous], PFAM 35 0 2021 PFAM
[8]  
[Anonymous], 2013, 30 INT C MACH LEARN
[9]  
Apweiler R, 2004, NUCLEIC ACIDS RES, V32, pD115, DOI [10.1093/nar/gkh131, 10.1093/nar/gkw1099]
[10]   Predicting protein distance maps according to physicochemical properties [J].
Asencio Cortes, Gualberto ;
Aguilar-Ruiz, Jesus S. .
JOURNAL OF INTEGRATIVE BIOINFORMATICS, 2011, 8 (03)