Exploring Machine Learning Algorithms and Protein Language Models Strategies to Develop Enzyme Classification Systems

被引:3
作者
Fernandez, Diego [1 ]
Olivera-Nappa, Alvaro [2 ,3 ]
Uribe-Paredes, Roberto [1 ]
Medina-Ortiz, David [1 ,2 ]
机构
[1] Univ Magallanes, Dept Ingn Computac, Ave Pdte Manuel Bulnes, Punta Arenas 01855, Chile
[2] Univ Chile, Dept Ingn Quim Biotecnol & Mat, Beauche 851, Santiago, Chile
[3] Univ Chile, Ctr Biotechnol & Bioengn, Beauchef 851, Santiago, Chile
来源
BIOINFORMATICS AND BIOMEDICAL ENGINEERING, IWBBIO 2023, PT I | 2023年 / 13919卷
关键词
Machine learning algorithms; protein language models; EC number classifications; convolutional neural networks; enzyme discovery; PREDICTION;
D O I
10.1007/978-3-031-34953-9_24
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
Discovering functionalities for unknown enzymes has been one of the most common bioinformatics tasks. Functional annotation methods based on phylogenetic properties have been the gold standard in every genome annotation process. However, these methods only succeed if the minimum requirements for expressing similarity or homology are met. Alternatively, machine learning and deep learning methods have proven helpful in this problem, developing functional classification systems in various bioinformatics tasks. Nevertheless, there needs to be a clear strategy for elaborating predictive models and how amino acid sequences should be represented. In this work, we address the problem of functional classification of enzyme sequences (EC number) via machine learning methods, exploring various alternatives for training predictive models and numerical representation methods. The results show that the best performances are achieved by applying representations based on pre-trained models. However, there needs to be a clear strategy to train models. Therefore, when exploring several alternatives, it is observed that the methods based on CNN architectures proposed in this work present a more outstanding facility for learning and pattern extraction in complex systems, achieving performances above 97% and with error rates lower than 0.05 of binary cross entropy. Finally, we discuss the strategies explored and analyze future work to develop integrated methods for functional classification and the discovery of new enzymes to support current bioinformatics tools.
引用
收藏
页码:307 / 319
页数:13
相关论文
共 30 条
[1]   EFICAz2: enzyme function inference by a combined approach enhanced by machine learning [J].
Arakaki, Adrian K. ;
Huang, Ying ;
Skolnick, Jeffrey .
BMC BIOINFORMATICS, 2009, 10
[2]   Industrial applications of immobilized enzymes-A review [J].
Basso, Alessandra ;
Serban, Simona .
MOLECULAR CATALYSIS, 2019, 479 :35-54
[3]   UniProt: a worldwide hub of protein knowledge [J].
Bateman, Alex ;
Martin, Maria-Jesus ;
Orchard, Sandra ;
Magrane, Michele ;
Alpi, Emanuele ;
Bely, Benoit ;
Bingley, Mark ;
Britto, Ramona ;
Bursteinas, Borisas ;
Busiello, Gianluca ;
Bye-A-Jee, Hema ;
Da Silva, Alan ;
De Giorgi, Maurizio ;
Dogan, Tunca ;
Castro, Leyla Garcia ;
Garmiri, Penelope ;
Georghiou, George ;
Gonzales, Daniel ;
Gonzales, Leonardo ;
Hatton-Ellis, Emma ;
Ignatchenko, Alexandr ;
Ishtiaq, Rizwan ;
Jokinen, Petteri ;
Joshi, Vishal ;
Jyothi, Dushyanth ;
Lopez, Rodrigo ;
Luo, Jie ;
Lussi, Yvonne ;
MacDougall, Alistair ;
Madeira, Fabio ;
Mahmoudy, Mahdi ;
Menchi, Manuela ;
Nightingale, Andrew ;
Onwubiko, Joseph ;
Palka, Barbara ;
Pichler, Klemens ;
Pundir, Sangya ;
Qi, Guoying ;
Raj, Shriya ;
Renaux, Alexandre ;
Lopez, Milagros Rodriguez ;
Saidi, Rabie ;
Sawford, Tony ;
Shypitsyna, Aleksandra ;
Speretta, Elena ;
Turner, Edward ;
Tyagi, Nidhi ;
Vasudev, Preethi ;
Volynkin, Vladimir ;
Wardell, Tony .
NUCLEIC ACIDS RESEARCH, 2019, 47 (D1) :D506-D515
[4]   Machine learning techniques for protein function prediction [J].
Bonetta, Rosalin ;
Valentino, Gianluca .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2020, 88 (03) :397-413
[5]  
Burley SK, 2017, METHODS MOL BIOL, V1606, P627, DOI 10.1007/978-1-4939-7000-1_26
[6]   A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes [J].
Cadet, Frederic ;
Fontaine, Nicolas ;
Li, Guangyue ;
Sanchis, Joaquin ;
Chong, Matthieu Ng Fuk ;
Pandjaitan, Rudy ;
Vetrivel, Iyanar ;
Offmann, Bernard ;
Reetz, Manfred T. .
SCIENTIFIC REPORTS, 2018, 8
[7]   Biopython']python: freely available Python']Python tools for computational molecular biology and bioinformatics [J].
Cock, Peter J. A. ;
Antao, Tiago ;
Chang, Jeffrey T. ;
Chapman, Brad A. ;
Cox, Cymon J. ;
Dalke, Andrew ;
Friedberg, Iddo ;
Hamelryck, Thomas ;
Kauff, Frank ;
Wilczynski, Bartek ;
de Hoon, Michiel J. L. .
BIOINFORMATICS, 2009, 25 (11) :1422-1423
[8]  
Copeland R.A., 2023, Enzymes: A Practical Introduction to Structure, Mechanism, and Data Analysis, V3rd
[9]   Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets [J].
Dallago, Christian ;
Schuetze, Konstantin ;
Heinzinger, Michael ;
Olenyi, Tobias ;
Littmann, Maria ;
Lu, Amy X. ;
Yang, Kevin K. ;
Min, Seonwoo ;
Yoon, Sungroh ;
Morton, James T. ;
Rost, Burkhard .
CURRENT PROTOCOLS, 2021, 1 (05)
[10]   Deep Learning in Protein Structural Modeling and Design [J].
Gao, Wenhao ;
Mahajan, Sai Pooja ;
Sulam, Jeremias ;
Gray, Jeffrey J. .
PATTERNS, 2020, 1 (09)