In-Pero: Exploiting Deep Learning Embeddings of Protein Sequences to Predict the Localisation of Peroxisomal Proteins

被引:18
作者
Anteghini, Marco [1 ,2 ]
dos Santos, Vitor Martins [1 ,2 ]
Saccenti, Edoardo [1 ]
机构
[1] Wageningen Univ & Res, Lab Syst & Synthet Biol, Stippeneng 4, NL-6708 WE Wageningen, Netherlands
[2] LifeGlimmer GmbH, D-12163 Berlin, Germany
基金
欧盟地平线“2020”;
关键词
protein sequence encoding and embedding; machine learning; neural networks; subcellular localisation; sub-peroxisomal localisation; sub-mitochondrial localisation; SUBCELLULAR-LOCALIZATION; MITOCHONDRIAL; TOPOLOGY; MODEL;
D O I
10.3390/ijms22126409
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Peroxisomes are ubiquitous membrane-bound organelles, and aberrant localisation of peroxisomal proteins contributes to the pathogenesis of several disorders. Many computational methods focus on assigning protein sequences to subcellular compartments, but there are no specific tools tailored for the sub-localisation (matrix vs. membrane) of peroxisome proteins. We present here In-Pero, a new method for predicting protein sub-peroxisomal cellular localisation. In-Pero combines standard machine learning approaches with recently proposed multi-dimensional deep-learning representations of the protein amino-acid sequence. It showed a classification accuracy above 0.9 in predicting peroxisomal matrix and membrane proteins. The method is trained and tested using a double cross-validation approach on a curated data set comprising 160 peroxisomal proteins with experimental evidence for sub-peroxisomal localisation. We further show that the proposed approach can be easily adapted (In-Mito) to the prediction of mitochondrial protein localisation obtaining performances for certain classes of proteins (matrix and inner-membrane) superior to existing tools.
引用
收藏
页数:16
相关论文
共 56 条
[1]   Unified rational protein engineering with sequence-based deep representation learning [J].
Alley, Ethan C. ;
Khimulya, Grigory ;
Biswas, Surojit ;
AlQuraishi, Mohammed ;
Church, George M. .
NATURE METHODS, 2019, 16 (12) :1315-+
[2]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[3]   Adaptation of protein surfaces to subcellular location [J].
Andrade, MA ;
O'Donoghue, SI ;
Rost, B .
JOURNAL OF MOLECULAR BIOLOGY, 1998, 276 (02) :517-525
[4]   Long short-term memory [J].
Hochreiter, S ;
Schmidhuber, J .
NEURAL COMPUTATION, 1997, 9 (08) :1735-1780
[5]   DeepLoc: prediction of protein subcellular localization using deep learning [J].
Armenteros, Jose Juan Almagro ;
Sonderby, Casper Kaae ;
Sonderby, Soren Kaae ;
Nielsen, Henrik ;
Winther, Ole .
BIOINFORMATICS, 2017, 33 (21) :3387-3395
[6]  
Attwood T, 2004, DICT BIOINFORMATICS, DOI [10.1002/0471650129.dob0566, DOI 10.1002/0471650129.DOB0566]
[7]   Peroxisomal ABC transporters: functions and mechanism [J].
Baker, Alison ;
Carrier, David J. ;
Schaedler, Theresia ;
Waterham, Hans R. ;
van Roermund, Carlo W. ;
Theodoulou, Frederica L. .
BIOCHEMICAL SOCIETY TRANSACTIONS, 2015, 43 :959-965
[8]  
Boser B. E., 1992, Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, P144, DOI 10.1145/130385.130401
[9]   Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric [J].
Boughorbel, Sabri ;
Jarray, Fethi ;
El-Anbari, Mohammed .
PLOS ONE, 2017, 12 (06)
[10]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32