Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets

被引:58
作者
Dallago, Christian [1 ,2 ]
Schuetze, Konstantin [1 ]
Heinzinger, Michael [1 ,2 ]
Olenyi, Tobias [1 ]
Littmann, Maria [1 ,2 ]
Lu, Amy X. [3 ]
Yang, Kevin K. [4 ]
Min, Seonwoo [5 ]
Yoon, Sungroh [5 ,6 ]
Morton, James T. [7 ]
Rost, Burkhard [1 ,8 ,9 ,10 ,11 ]
机构
[1] TUM Tech Univ Munich, Dept Informat Bioinformat & Computat Biol, Garching, Germany
[2] TUM Grad Sch, CeDoSIA, Garching, Germany
[3] Univ Toronto, Dept Comp Sci, Toronto, ON, Canada
[4] Microsoft Res New England, Cambridge, MA USA
[5] Seoul Natl Univ, Dept Elect & Comp Engn, Seoul, South Korea
[6] Seoul Natl Univ, Interdisciplinary Program Bioinformat, Seoul, South Korea
[7] Flatiron Inst, Ctr Computat Biol, New York, NY USA
[8] Inst Adv Study TUM IAS, Garching, Germany
[9] TUM Sch Life Sci Weihenstephan WZW, Freising Weihenstephan, Germany
[10] Columbia Univ, Dept Biochem & Mol Biophys New York New York, Columbia, MD USA
[11] NYCOMPS, New York, NY USA
来源
CURRENT PROTOCOLS | 2021年 / 1卷 / 05期
关键词
deep learning embeddings; machine learning; protein annotation pipeline; protein representations; protein visualization; SECONDARY STRUCTURE;
D O I
10.1002/cpz1.113
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Models from machine learning (ML) or artificial intelligence (AI) increasingly assist in guiding experimental design and decision making in molecular biology and medicine. Recently, Language Models (LMs) have been adapted from Natural Language Processing (NLP) to encode the implicit language written in protein sequences. Protein LMs show enormous potential in generating descriptive representations (embeddings) for proteins from just their sequences, in a fraction of the time with respect to previous approaches, yet with comparable or improved predictive ability. Researchers have trained a variety of protein LMs that are likely to illuminate different angles of the protein language. By leveraging the bio_embeddings pipeline and modules, simple and reproducible workflows can be laid out to generate protein embeddings and rich visualizations. Embeddings can then be leveraged as input features through machine learning libraries to develop methods predicting particular aspects of protein function and structure. Beyond the workflows included here, embeddings have been leveraged as proxies to traditional homology-based inference and even to align similar protein sequences. A wealth of possibilities remain for researchers to harness through the tools provided in the following protocols. (c) 2021 The Authors. Current Protocols published by Wiley Periodicals LLC.
引用
收藏
页数:26
相关论文
共 63 条
[1]   Unified rational protein engineering with sequence-based deep representation learning [J].
Alley, Ethan C. ;
Khimulya, Grigory ;
Biswas, Surojit ;
AlQuraishi, Mohammed ;
Church, George M. .
NATURE METHODS, 2019, 16 (12) :1315-+
[2]   End-to-End Differentiable Learning of Protein Structure [J].
AlQuraishi, Mohammed .
CELL SYSTEMS, 2019, 8 (04) :292-+
[3]  
Anaconda Software Distribution, 2020, Anaconda documentation
[4]  
[Anonymous], 2007, P 24 INT C MACHINE, DOI [DOI 10.1145/1273496.1273592, 10.1145/1273496.1273592]
[5]  
Armenteros J. J. A., 2020, bioRxiv
[6]   DeepLoc: prediction of protein subcellular localization using deep learning [J].
Armenteros, Jose Juan Almagro ;
Sonderby, Casper Kaae ;
Sonderby, Soren Kaae ;
Nielsen, Henrik ;
Winther, Ole .
BIOINFORMATICS, 2017, 33 (21) :3387-3395
[7]  
Brown TB, 2020, Arxiv, DOI [arXiv:2005.14165, 10.48550/arXiv.2005.14165, DOI 10.48550/ARXIV.2005.14165]
[8]   UniProt: a worldwide hub of protein knowledge [J].
Bateman, Alex ;
Martin, Maria-Jesus ;
Orchard, Sandra ;
Magrane, Michele ;
Alpi, Emanuele ;
Bely, Benoit ;
Bingley, Mark ;
Britto, Ramona ;
Bursteinas, Borisas ;
Busiello, Gianluca ;
Bye-A-Jee, Hema ;
Da Silva, Alan ;
De Giorgi, Maurizio ;
Dogan, Tunca ;
Castro, Leyla Garcia ;
Garmiri, Penelope ;
Georghiou, George ;
Gonzales, Daniel ;
Gonzales, Leonardo ;
Hatton-Ellis, Emma ;
Ignatchenko, Alexandr ;
Ishtiaq, Rizwan ;
Jokinen, Petteri ;
Joshi, Vishal ;
Jyothi, Dushyanth ;
Lopez, Rodrigo ;
Luo, Jie ;
Lussi, Yvonne ;
MacDougall, Alistair ;
Madeira, Fabio ;
Mahmoudy, Mahdi ;
Menchi, Manuela ;
Nightingale, Andrew ;
Onwubiko, Joseph ;
Palka, Barbara ;
Pichler, Klemens ;
Pundir, Sangya ;
Qi, Guoying ;
Raj, Shriya ;
Renaux, Alexandre ;
Lopez, Milagros Rodriguez ;
Saidi, Rabie ;
Sawford, Tony ;
Shypitsyna, Aleksandra ;
Speretta, Elena ;
Turner, Edward ;
Tyagi, Nidhi ;
Vasudev, Preethi ;
Volynkin, Vladimir ;
Wardell, Tony .
NUCLEIC ACIDS RESEARCH, 2019, 47 (D1) :D506-D515
[9]  
Bender Emily M., 2020, P 58 ANN M ASS COMPU, P5185, DOI [10.18653/v1/2020.acl-main.463, DOI 10.18653/V1/2020.ACL-MAIN.463]
[10]   SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information [J].
Biasini, Marco ;
Bienert, Stefan ;
Waterhouse, Andrew ;
Arnold, Konstantin ;
Studer, Gabriel ;
Schmidt, Tobias ;
Kiefer, Florian ;
Cassarino, Tiziano Gallo ;
Bertoni, Martino ;
Bordoli, Lorenza ;
Schwede, Torsten .
NUCLEIC ACIDS RESEARCH, 2014, 42 (W1) :W252-W258