Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

被引:51
作者
Asgari, Ehsaneddin [1 ,2 ,3 ]
McHardy, Alice C. [3 ]
Mofrad, Mohammad R. K. [1 ,2 ,4 ]
机构
[1] Univ Calif Berkeley, Dept Bioengn, Mol Cell Biomech Lab, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, Dept Mech Engn, Mol Cell Biomech Lab, Berkeley, CA 94720 USA
[3] Helmholtz Ctr Infect Res, Computat Biol Infect Res, D-38124 Braunschweig, Germany
[4] Lawrence Berkeley Natl Lab, Mol Biophys & Integrated Bioimaging, Berkeley, CA 94720 USA
关键词
BINDING; LANGUAGE; RGD; SPECIFICITIES; PREDICTION; UNIPROT;
D O I
10.1038/s41598-019-38746-w
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin- binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.
引用
收藏
页数:16
相关论文
共 72 条
[1]   Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning [J].
Alipanahi, Babak ;
Delong, Andrew ;
Weirauch, Matthew T. ;
Frey, Brendan J. .
NATURE BIOTECHNOLOGY, 2015, 33 (08) :831-+
[2]  
[Anonymous], 1993, ARTIF INTELL
[3]  
Apweiler R, 2004, NUCLEIC ACIDS RES, V32, pD115, DOI [10.1093/nar/gkw1099, 10.1093/nar/gkh131]
[4]   DiTaxa: nucleotide-pair encoding of 16S rRNA for host phenotype and biomarker detection [J].
Asgari, Ehsaneddin ;
Muench, Philipp C. ;
Lesker, Till R. ;
McHardy, Alice C. ;
Mofrad, Mohammad R. K. .
BIOINFORMATICS, 2019, 35 (14) :2498-2500
[5]   MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples [J].
Asgari, Ehsaneddin ;
Garakani, Kiavash ;
McHardy, Alice C. ;
Mofrad, Mohammad R. K. .
BIOINFORMATICS, 2018, 34 (13) :32-42
[6]   Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics [J].
Asgari, Ehsaneddin ;
Mofrad, Mohammad R. K. .
PLOS ONE, 2015, 10 (11)
[7]  
Asgari Ehsaneddin., 2016, P WORKSHOP MULTILING, P65
[8]   Prediction of nucleosome positioning by the incorporation of frequencies and distributions three different nucleotide segment lengths into a general pseudo k-tuple nucleotide composition [J].
Awazu, Akinori .
BIOINFORMATICS, 2017, 33 (01) :42-48
[9]   MEME SUITE: tools for motif discovery and searching [J].
Bailey, Timothy L. ;
Boden, Mikael ;
Buske, Fabian A. ;
Frith, Martin ;
Grant, Charles E. ;
Clementi, Luca ;
Ren, Jingyuan ;
Li, Wilfred W. ;
Noble, William S. .
NUCLEIC ACIDS RESEARCH, 2009, 37 :W202-W208
[10]   NLSdb-major update for database of nuclear localization signals and nuclear export signals [J].
Bernhofer, Michael ;
Goldberg, Tatyana ;
Wolf, Silvana ;
Ahmed, Mohamed ;
Zaugg, Julian ;
Boden, Mikael ;
Rost, Burkhard .
NUCLEIC ACIDS RESEARCH, 2018, 46 (D1) :D503-D508