Advances in the Prediction of Protein Subcellular Locations with Machine Learning

被引:14
作者
Zhang, Ting-He [1 ,2 ]
Zhang, Shao-Wu [2 ]
机构
[1] Univ Texas San Antonio, Dept Elect & Comp Engn, San Antonio, TX 78230 USA
[2] Northwestern Polytech Univ, Sch Automat, Xian 710072, Shaanxi, Peoples R China
基金
中国国家自然科学基金;
关键词
Protein subcellular location; prediction; dataset construction; feature representation; machine learning; protein sequences; AMINO-ACID-COMPOSITION; MULTIPLE CLASSIFIER FUSION; TOP-DOWN APPROACH; LOCALIZATION PREDICTION; ROTATION FOREST; GENE ONTOLOGY; QUATERNARY STRUCTURE; LABEL CLASSIFIER; MPLOC; SINGLE;
D O I
10.2174/1574893614666181217145156
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Revealing the subcellular location of a newly discovered protein can bring insight into their function and guide research at the cellular level. The experimental methods currently used to identify the protein subcellular locations are both time-consuming and expensive. Thus, it is highly desired to develop computational methods for efficiently and effectively identifying the protein subcellular locations. Especially, the rapidly increasing number of protein sequences entering the genome databases has called for the development of automated analysis methods. Methods: In this review, we will describe the recent advances in predicting the protein subcellular locations with machine learning from the following aspects: i) Protein subcellular location benchmark dataset construction, ii) Protein feature representation and feature descriptors, iii) Common machine learning algorithms, iv) Cross-validation test methods and assessment metrics, v) Web servers. Result & Conclusion: Concomitant with a large number of protein sequences generated by high-throughput technologies, four future directions for predicting protein subcellular locations with machine learning should be paid attention. One direction is the selection of novel and effective features (e.g., statistics, physical-chemical, evolutional) from the sequences and structures of proteins. Another is the feature fusion strategy. The third is the design of a powerful predictor and the fourth one is the protein multiple location sites prediction.
引用
收藏
页码:406 / 421
页数:16
相关论文
共 123 条
[1]  
[Anonymous], 1999, REPOSIT TU DORTMUND, DOI DOI 10.17877/DE290R-5098
[2]  
[Anonymous], 2017, GENOMICS
[3]  
[Anonymous], 2006, U B C
[4]   DeepLoc: prediction of protein subcellular localization using deep learning [J].
Armenteros, Jose Juan Almagro ;
Sonderby, Casper Kaae ;
Sonderby, Soren Kaae ;
Nielsen, Henrik ;
Winther, Ole .
BIOINFORMATICS, 2017, 33 (21) :3387-3395
[5]   Support Vector Machines and Kernels for Computational Biology [J].
Ben-Hur, Asa ;
Ong, Cheng Soon ;
Sonnenburg, Soeren ;
Schoelkopf, Bernhard ;
Raetsch, Gunnar .
PLOS COMPUTATIONAL BIOLOGY, 2008, 4 (10)
[6]   ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST [J].
Bhasin, M ;
Raghava, GPS .
NUCLEIC ACIDS RESEARCH, 2004, 32 :W414-W419
[7]   MultiLoc2: integrating phylogeny and Gene Ontology terms improves subcellular protein localization prediction [J].
Blum, Torsten ;
Briesemeister, Sebastian ;
Kohlbacher, Oliver .
BMC BIOINFORMATICS, 2009, 10 :274
[8]   The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 [J].
Boeckmann, B ;
Bairoch, A ;
Apweiler, R ;
Blatter, MC ;
Estreicher, A ;
Gasteiger, E ;
Martin, MJ ;
Michoud, K ;
O'Donovan, C ;
Phan, I ;
Pilbout, S ;
Schneider, M .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :365-370
[9]   The effect of organelle discovery upon sub-cellular protein localisation [J].
Breckels, L. M. ;
Gatto, L. ;
Christoforou, A. ;
Groen, A. J. ;
Lilley, K. S. ;
Trotter, M. W. B. .
JOURNAL OF PROTEOMICS, 2013, 88 :129-140
[10]   YLoc-an interpretable web server for predicting subcellular localization [J].
Briesemeister, Sebastian ;
Rahnenfuehrer, Joerg ;
Kohlbacher, Oliver .
NUCLEIC ACIDS RESEARCH, 2010, 38 :W497-W502