An improved deep learning model for hierarchical classification of protein families

被引:12
作者
Sandaruwan, Pahalage Dhanushka [1 ]
Wannige, Champi Thusangi [1 ]
机构
[1] Univ Ruhuna, Dept Comp Sci, Matara, Sri Lanka
来源
PLOS ONE | 2021年 / 16卷 / 10期
关键词
ROC CURVE; PREDICTION; DATABASE;
D O I
10.1371/journal.pone.0258625
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Although genes carry information, proteins are the main role player in providing all the functionalities of a living organism. Massive amounts of different proteins involve in every function that occurs in a cell. These amino acid sequences can be hierarchically classified into a set of families and subfamilies depending on their evolutionary relatedness and similarities in their structure or function. Protein characterization to identify protein structure and function is done accurately using laboratory experiments. With the rapidly increasing huge amount of novel protein sequences, these experiments have become difficult to carry out since they are expensive, time-consuming, and laborious. Therefore, many computational classification methods are introduced to classify proteins and predict their functional properties. With the progress of the performance of the computational techniques, deep learning plays a key role in many areas. Novel deep learning models such as DeepFam, ProtCNN have been presented to classify proteins into their families recently. However, these deep learning models have been used to carry out the non-hierarchical classification of proteins. In this research, we propose a deep learning neural network model named DeepHiFam with high accuracy to classify proteins hierarchically into different levels simultaneously. The model achieved an accuracy of 98.38% for protein family classification and more than 80% accuracy for the classification of protein subfamilies and sub-subfamilies. Further, DeepHiFam performed well in the non-hierarchical classification of protein families and achieved an accuracy of 98.62% and 96.14% for the popular Pfam dataset and COG dataset respectively.
引用
收藏
页数:15
相关论文
共 40 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures [J].
Andreeva, Antonina ;
Kulesha, Eugene ;
Gough, Julian ;
Murzin, Alexey G. .
NUCLEIC ACIDS RESEARCH, 2020, 48 (D1) :D376-D382
[3]  
[Anonymous], ADV TECHN BIOL MED, DOI 10.4172/2379-1764.1000139
[4]   The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 [J].
Bairoch, A ;
Apweiler, R .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :45-48
[5]   UniProt: a worldwide hub of protein knowledge [J].
Bateman, Alex ;
Martin, Maria-Jesus ;
Orchard, Sandra ;
Magrane, Michele ;
Alpi, Emanuele ;
Bely, Benoit ;
Bingley, Mark ;
Britto, Ramona ;
Bursteinas, Borisas ;
Busiello, Gianluca ;
Bye-A-Jee, Hema ;
Da Silva, Alan ;
De Giorgi, Maurizio ;
Dogan, Tunca ;
Castro, Leyla Garcia ;
Garmiri, Penelope ;
Georghiou, George ;
Gonzales, Daniel ;
Gonzales, Leonardo ;
Hatton-Ellis, Emma ;
Ignatchenko, Alexandr ;
Ishtiaq, Rizwan ;
Jokinen, Petteri ;
Joshi, Vishal ;
Jyothi, Dushyanth ;
Lopez, Rodrigo ;
Luo, Jie ;
Lussi, Yvonne ;
MacDougall, Alistair ;
Madeira, Fabio ;
Mahmoudy, Mahdi ;
Menchi, Manuela ;
Nightingale, Andrew ;
Onwubiko, Joseph ;
Palka, Barbara ;
Pichler, Klemens ;
Pundir, Sangya ;
Qi, Guoying ;
Raj, Shriya ;
Renaux, Alexandre ;
Lopez, Milagros Rodriguez ;
Saidi, Rabie ;
Sawford, Tony ;
Shypitsyna, Aleksandra ;
Speretta, Elena ;
Turner, Edward ;
Tyagi, Nidhi ;
Vasudev, Preethi ;
Volynkin, Vladimir ;
Wardell, Tony .
NUCLEIC ACIDS RESEARCH, 2019, 47 (D1) :D506-D515
[6]   On the Encoding of Proteins for Disordered Regions Prediction [J].
Becker, Julien ;
Maes, Francis ;
Wehenkel, Louis .
PLOS ONE, 2013, 8 (12)
[7]  
Berrar D., 2019, Cross-validation, V1, P542, DOI [DOI 10.1016/B978-0-12-809633-8.20349-X, 10.1016 /B978-0-12-809633-8.20349-X]
[8]  
Bileschi ML, 2019, bioRxiv, DOI [10.1101/626507, 10.1101/626507, DOI 10.1101/626507]
[9]  
Buxbaum E., 2007, FUNDAMENTALS PROTEIN, P1
[10]   Critiquing Protein Family Classification Models Using Sufficient Input Subsets [J].
Carter, Brandon ;
Bileschi, Maxwell ;
Smith, Jamie ;
Sanderson, Theo ;
Bryant, Drew ;
Belanger, David ;
Colwell, Lucy J. .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2020, 27 (08) :1219-1231