EnZymClass: Substrate specificity prediction tool of plant acyl-ACP thioesterases based on ensemble learning

被引:10
作者
Banerjee, Deepro [1 ]
Jindra, Michael A. [2 ]
Linot, Alec J. [2 ]
Pfleger, Brian F. [2 ]
Maranas, Costas D. [3 ]
机构
[1] Penn State Univ, Huck Inst Life Sci, Bioinformat & Genom Program, University Pk, PA 16802 USA
[2] Univ Wisconsin, Dept Chem & Biol Engn, Madison, WI USA
[3] Penn State Univ, Dept Chem Engn, University Pk, PA 16802 USA
基金
美国国家科学基金会;
关键词
Thioesterase; Enzyme classification; Machine learning; Substrate specificity; Medium-chain oleochemicals; Synthetic biology;
D O I
10.1016/j.crbiot.2021.12.002
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Characterizing the functional properties of plant acyl-ACP thioesterases (TEs), a key enzyme class used in the production of renewable oleochemicals in microbial hosts, experimentally, can be an expensive and time consuming process since it requires manual screening of thousands of candidates in a database. Using amino acid sequence to computationally predict an enzyme's function might accelerate this process; however obtaining the necessary amount of information on previously characterized enzymes and their respective sequences required by standard Machine Learning (ML) based approaches to accurately infer sequence-function relationships can be prohibitive, especially with a low-throughput testing cycle. Experimental noise, unbalanced dataset where high sequence similarity does not always imply identical functional properties will further prevent robust prediction performance. Herein we present a ML method, Ensemble method for enZyme Classification (EnZymClass), that is specifically designed to address these issues. We used EnZymClass to classify TEs into short, long and mixed free fatty acid substrate specificity categories. While general guidelines for inferring substrate specificity have been proposed before, prediction of chain-length preference from primary sequence has remained elusive for plant acyl-ACP TEs. By applying EnZymClass to a subset of TEs in the ThYme database, we identified two medium chain TEs, ClFatB3 and CwFatB2, with previously uncharacterized activity in E. coli fatty acid production hosts. EnZymClass can be readily applied to other protein classification challenges and is available at: https:// github.com/deeprob/ThioesteraseEnzymeSpecificity.
引用
收藏
页码:1 / 9
页数:9
相关论文
共 50 条
[1]   Unified rational protein engineering with sequence-based deep representation learning [J].
Alley, Ethan C. ;
Khimulya, Grigory ;
Biswas, Surojit ;
AlQuraishi, Mohammed ;
Church, George M. .
NATURE METHODS, 2019, 16 (12) :1315-+
[2]   Prediction and experimental validation of enzyme substrate specificity in protein structures [J].
Amin, Shivas R. ;
Erdin, Serkan ;
Ward, R. Matthew ;
Lua, Rhonald C. ;
Lichtarge, Olivier .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2013, 110 (45) :E4195-E4202
[3]   Support Vector Machines and Kernels for Computational Biology [J].
Ben-Hur, Asa ;
Ong, Cheng Soon ;
Sonnenburg, Soeren ;
Schoelkopf, Bernhard ;
Raetsch, Gunnar .
PLOS COMPUTATIONAL BIOLOGY, 2008, 4 (10)
[4]  
Camoglu Orhan, 2005, Journal of Bioinformatics and Computational Biology, V3, P717, DOI 10.1142/S0219720005001259
[5]   Thioesterases: A new perspective based on their primary and tertiary structures [J].
Cantu, David C. ;
Chen, Yingfei ;
Reilly, Peter J. .
PROTEIN SCIENCE, 2010, 19 (07) :1281-1295
[6]   GlycoPP: A Webserver for Prediction of N- and O-Glycosites in Prokaryotic Protein Sequences [J].
Chauhan, Jagat S. ;
Bhat, Adil H. ;
Raghava, Gajendra P. S. ;
Rao, Alka .
PLOS ONE, 2012, 7 (07)
[7]   iFeature: a Python']Python package and web server for features extraction and selection from protein and peptide sequences [J].
Chen, Zhen ;
Zhao, Pei ;
Li, Fuyi ;
Leier, Andre ;
Marquez-Lago, Tatiana T. ;
Wang, Yanan ;
Webb, Geoffrey I. ;
Smith, A. Ian ;
Daly, Roger J. ;
Chou, Kuo-Chen ;
Song, Jiangning .
BIOINFORMATICS, 2018, 34 (14) :2499-2502
[8]   Production of high levels of 8:0 and 10:0 fatty acids in transgenic canola by overexpression of Ch FatB2, a thioesterase cDNA from Cuphea hookeriana [J].
Dehesh, K ;
Jones, A ;
Knutzon, DS ;
Voelker, TA .
PLANT JOURNAL, 1996, 9 (02) :167-172
[9]  
Deshpande M., 2002, Advances in Knowledge Discovery and Data Mining. 6th Pacific-Asia Conference, PAKDD 2002. Proceedings (Lecture Notes in Artificial Intelligence Vol.2336), P417
[10]   Multi-class protein fold recognition using support vector machines and neural networks [J].
Ding, CHQ ;
Dubchak, I .
BIOINFORMATICS, 2001, 17 (04) :349-358