PredictEFC: a fast and efficient multi-label classifier for predicting enzyme family classes

被引:14
作者
Chen, Lei [1 ]
Zhang, Chenyu [1 ]
Xu, Jing [1 ]
机构
[1] Shanghai Maritime Univ, Coll Informat Engn, Shanghai 201306, Peoples R China
关键词
Enzymes; Family class; Multi-label classification; Functional domain; Dimension reduction; Random forest; AMINO-ACID-COMPOSITION; FUNCTIONAL DOMAIN COMPOSITION; SUPPORT VECTOR MACHINE; PROTEIN-STRUCTURE; DATABASE;
D O I
10.1186/s12859-024-05665-1
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
BackgroundEnzymes play an irreplaceable and important role in maintaining the lives of living organisms. The Enzyme Commission (EC) number of an enzyme indicates its essential functions. Correct identification of the first digit (family class) of the EC number for a given enzyme is a hot topic in the past twenty years. Several previous methods adopted functional domain composition to represent enzymes. However, it would lead to dimension disaster, thereby reducing the efficiency of the methods. On the other hand, most previous methods can only deal with enzymes belonging to one family class. In fact, several enzymes belong to two or more family classes.ResultsIn this study, a fast and efficient multi-label classifier, named PredictEFC, was designed. To construct this classifier, a novel feature extraction scheme was designed for processing functional domain information of enzymes, which counting the distribution of each functional domain entry across seven family classes in the training dataset. Based on this scheme, each training or test enzyme was encoded into a 7-dimenion vector by fusing its functional domain information and above statistical results. Random k-labelsets (RAKEL) was adopted to build the classifier, where random forest was selected as the base classification algorithm. The two tenfold cross-validation results on the training dataset shown that the accuracy of PredictEFC can reach 0.8493 and 0.8370. The independent test on two datasets indicated the accuracy values of 0.9118 and 0.8777.ConclusionThe performance of PredictEFC was slightly lower than the classifier directly using functional domain composition. However, its efficiency was sharply improved. The running time was less than one-tenth of the time of the classifier directly using functional domain composition. In additional, the utility of PredictEFC was superior to the classifiers using traditional dimensionality reduction methods and some previous methods, and this classifier can be transplanted for predicting enzyme family classes of other species. Finally, a web-server available at http://124.221.158.221/ was set up for easy usage.
引用
收藏
页数:27
相关论文
共 54 条
[1]   The InterPro database, an integrated documentation resource for protein families, domains and functional sites [J].
Apweiler, R ;
Attwood, TK ;
Bairoch, A ;
Bateman, A ;
Birney, E ;
Biswas, M ;
Bucher, P ;
Cerutti, T ;
Corpet, F ;
Croning, MDR ;
Durbin, R ;
Falquet, L ;
Fleischmann, W ;
Gouzy, J ;
Hermjakob, H ;
Hulo, N ;
Jonassen, I ;
Kahn, D ;
Kanapin, A ;
Karavidopoulou, Y ;
Lopez, R ;
Marx, B ;
Mulder, NJ ;
Oinn, TM ;
Pagni, M ;
Servant, F ;
Sigrist, CJA ;
Zdobnov, EM .
NUCLEIC ACIDS RESEARCH, 2001, 29 (01) :37-40
[2]   BENZ WS: the Bologna ENZyme Web Server for four-level EC number annotation [J].
Baldazzi, Davide ;
Savojardo, Castrense ;
Martelli, Pier Luigi ;
Casadio, Rita .
NUCLEIC ACIDS RESEARCH, 2021, 49 (W1) :W60-W66
[3]   The InterPro protein families and domains database: 20 years on [J].
Blum, Matthias ;
Chang, Hsin-Yu ;
Chuguransky, Sara ;
Grego, Tiago ;
Kandasaamy, Swaathi ;
Mitchell, Alex ;
Nuka, Gift ;
Paysan-Lafosse, Typhaine ;
Qureshi, Matloob ;
Raj, Shriya ;
Richardson, Lorna ;
Salazar, Gustavo A. ;
Williams, Lowri ;
Bork, Peer ;
Bridge, Alan ;
Gough, Julian ;
Haft, Daniel H. ;
Letunic, Ivica ;
Marchler-Bauer, Aron ;
Mi, Huaiyu ;
Natale, Darren A. ;
Necci, Marco ;
Orengo, Christine A. ;
Pandurangan, Arun P. ;
Rivoire, Catherine ;
Sigrist, Christian J. A. ;
Sillitoe, Ian ;
Thanki, Narmada ;
Thomas, Paul D. ;
Tosatto, Silvio C. E. ;
Wu, Cathy H. ;
Bateman, Alex ;
Finn, Robert D. .
NUCLEIC ACIDS RESEARCH, 2021, 49 (D1) :D344-D354
[4]  
Borro LC, 2006, GENET MOL RES, V5, P193
[5]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[6]   Enzyme family classification by support vector machines [J].
Cai, CZ ;
Han, LY ;
Ji, ZL ;
Chen, YZ .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2004, 55 (01) :66-76
[7]   Predicting enzyme subclass by functional domain composition and pseudo amino acid composition [J].
Cai, YD ;
Chou, KC .
JOURNAL OF PROTEOME RESEARCH, 2005, 4 (03) :967-971
[8]   Predicting enzyme family classes by hybridizing gene product composition and pseudo-amino acid composition [J].
Cai, YD ;
Zhou, GP ;
Chou, KC .
JOURNAL OF THEORETICAL BIOLOGY, 2005, 234 (01) :145-149
[9]   Using functional domain composition to predict enzyme family classes [J].
Cai, YD ;
Chou, KC .
JOURNAL OF PROTEOME RESEARCH, 2005, 4 (01) :109-111
[10]   Identification of Multi-Functional Enzyme with Multi-Label Classifier [J].
Che, Yuxin ;
Ju, Ying ;
Xuan, Ping ;
Long, Ren ;
Xing, Fei .
PLOS ONE, 2016, 11 (04)