DeepPPF: A deep learning framework for predicting protein family

被引:16
作者
Yusuf, Shehu Mohammed [1 ]
Zhang, Fuhao [1 ]
Zeng, Min [1 ]
Li, Min [1 ]
机构
[1] Cent South Univ, Sch Comp Sci & Engn, Changsha 410083, Peoples R China
基金
中国国家自然科学基金;
关键词
Multi-scale convolutional neural network; Protein functional family; Protein sequence; Deep learning; MULTIPLE SEQUENCE ALIGNMENT;
D O I
10.1016/j.neucom.2020.11.062
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machine learning pipelines for protein functional family prediction are urgently needed especially now that only 1% of raw protein sequences have been manually annotated. Although existing machine learning algorithms have achieved a decent performance in modeling and predicting the functional families of protein sequences, they still have two drawbacks. First, biological dependencies among nucleotides are not rich enough to describe motifs for these methods. Also, existing algorithms are not accurate enough to predict the functional families of newly discovered proteins. To address the above limitations simultaneously, we propose a novel deep learning framework for predicting protein family, DeepPPF, which employs the word2vec technique in capturing distributional dependencies among nucleotides and discovers rich features from diverse motif lengths to characterize proteins. The novelty of the DeepPPF is in utilizing distributional dependencies among nucleotides. Experimental results on G protein-coupled receptor hierarchical datasets show the effectiveness of DeepPPF in achieving the state-of-the-art performance in items of Mathew's correlation coefficients (MCC) of 97.62%, 88.45% and, 83.09% for family, subfamily and, sub-subfamily hierarchical levels, respectively. Also, DeepPPF outperformed existing methods in terms of prediction accuracy and Mathew's correlation coefficients on the cluster of orthologous groups (COG) and phage of orthologous groups (POG) datasets. Furthermore, we analyzed the ability of DeepPPF framework to discover rich motifs for functional classes with the least sets of protein sequences. The experimental results show that rich motif discovery is key to improving the modeling performance of protein families through deep learning techniques. Finally, we investigated the effect of transferring a low-level functional domain level to a high-level functional domain and results show that the target domain prediction can be improved with transfer learning. Therefore, our proposed deep learning framework can be useful in characterizing protein functional families. The codes and datasets are available at https://github.com/CSUBioGroup/DeepPPF. (C) 2020 Published by Elsevier B.V.
引用
收藏
页码:19 / 29
页数:11
相关论文
共 50 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]   Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning [J].
Alipanahi, Babak ;
Delong, Andrew ;
Weirauch, Matthew T. ;
Frey, Brendan J. .
NATURE BIOTECHNOLOGY, 2015, 33 (08) :831-+
[3]  
Altszyler E., 2016, CONSCIOUS COGN, V56, P178
[4]   Convolutional neural networks for classification of alignments of non-coding RNA sequences [J].
Aoki, Genta ;
Sakakibara, Yasubumi .
BIOINFORMATICS, 2018, 34 (13) :237-244
[5]   Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics [J].
Asgari, Ehsaneddin ;
Mofrad, Mohammad R. K. .
PLOS ONE, 2015, 10 (11)
[6]  
Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkh121, 10.1093/nar/gkr1065, 10.1093/nar/gkp985]
[7]   Effect of using suboptimal alignments in template-based protein structure prediction [J].
Chen, Hao ;
Kihara, Daisuke .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2011, 79 (01) :315-334
[8]   On the hierarchical classification of G protein-coupled receptors [J].
Davies, Matthew N. ;
Secker, Andrew ;
Freitas, Alex A. ;
Mendao, Miguel ;
Timmis, Jon ;
Flower, Darren R. .
BIOINFORMATICS, 2007, 23 (23) :3113-3118
[9]   Profile hidden Markov models [J].
Eddy, SR .
BIOINFORMATICS, 1998, 14 (09) :755-763
[10]   MUSCLE: multiple sequence alignment with high accuracy and high throughput [J].
Edgar, RC .
NUCLEIC ACIDS RESEARCH, 2004, 32 (05) :1792-1797