pLoc_bal-mAnimal: predict subcellular localization of animal proteins by balancing training dataset and PseAAC

被引:86
作者
Cheng, Xiang [1 ,2 ]
Lin, Wei-Zhong [1 ]
Xiao, Xuan [1 ,2 ]
Chou, Kuo-Chen [2 ,3 ]
机构
[1] Jingdezhen Ceram Inst, Comp Sci, Jingdezhen 333000, Peoples R China
[2] Gordon Life Sci Inst, Computat Biol, Boston, MA 02478 USA
[3] Univ Elect Sci & Technol China, Ctr Informat Biol, Chengdu 610054, Sichuan, Peoples R China
基金
中国国家自然科学基金;
关键词
AMINO-ACID-COMPOSITION; INCORPORATING EVOLUTIONARY INFORMATION; LYSINE SUCCINYLATION SITES; SEQUENCE-BASED PREDICTOR; ENSEMBLE CLASSIFIER; GENERAL-FORM; RECOMBINATION SPOTS; MULTI-LOCALIZATION; MEMBRANE-PROTEINS; CHOUS PSEAAC;
D O I
10.1093/bioinformatics/bty628
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: A cell contains numerous protein molecules. One of the fundamental goals in cell biology is to determine their subcellular locations, which can provide useful clues about their functions. Knowledge of protein subcellular localization is also indispensable for prioritizing and selecting the right targets for drug development. With the avalanche of protein sequences emerging in the post-genomic age, it is highly desired to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called 'pLoc-mAnimal' was developed for identifying the subcellular localization of animal proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with the multi-label systems in which some proteins, called 'multiplex proteins', may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mAnimal was trained by an extremely skewed dataset in which some subset (subcellular location) was about 128 times the size of the other subsets. Accordingly, such an uneven training dataset will inevitably cause a biased consequence. Results: To alleviate such biased consequence, we have developed a new and bias-reducing predictor called pLoc_bal-mAnimal by quasi-balancing the training dataset. Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mAnimal, the existing state-of-the-art predictor, in identifying the subcellular localization of animal proteins.
引用
收藏
页码:398 / 406
页数:9
相关论文
共 104 条
[61]   iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition [J].
Lin, Hao ;
Deng, En-Ze ;
Ding, Hui ;
Chen, Wei ;
Chou, Kuo-Chen .
NUCLEIC ACIDS RESEARCH, 2014, 42 (21) :12961-12972
[62]   iLoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins [J].
Lin, Wei-Zhong ;
Fang, Jian-An ;
Xiao, Xuan ;
Chou, Kuo-Chen .
MOLECULAR BIOSYSTEMS, 2013, 9 (04) :634-644
[63]   2L-piRNA: A Two-Layer Ensemble Classifier for Identifying Piwi-Interacting RNAs and Their Function [J].
Liu, Bin ;
Yang, Fan ;
Chou, Kuo-Chen .
MOLECULAR THERAPY-NUCLEIC ACIDS, 2017, 7 :267-277
[64]   Pse-Analysis: a python']python package for DNA/RNA and protein/peptide sequence analysis based on pseudo components and kernel methods [J].
Liu, Bin ;
Wu, Hao ;
Zhang, Deyuan ;
Wang, Xiaolong ;
Chou, Kuo-Chen .
ONCOTARGET, 2017, 8 (08) :13338-13343
[65]   iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC [J].
Liu, Bin ;
Yang, Fan ;
Huang, De-Shuang ;
Chou, Kuo-Chen .
BIOINFORMATICS, 2018, 34 (01) :33-40
[66]  
Liu B, 2016, BIOINFORMATICS, V32, P362, DOI [10.1093/bioinformatics/btw539, 10.1093/bioinformatics/btv604]
[67]   pRNAm-PC: Predicting N6-methyladenosine sites in RNA sequences via physical-chemical properties [J].
Liu, Zi ;
Xiao, Xuan ;
Yu, Dong-Jun ;
Jia, Jianhua ;
Qiu, Wang-Ren ;
Chou, Kuo-Chen .
ANALYTICAL BIOCHEMISTRY, 2016, 497 :60-67
[68]   iDNA-Methyl: Identifying DNA methylation sites via pseudo trinucleotide composition [J].
Liu, Zi ;
Xiao, Xuan ;
Qiu, Wang-Ren ;
Chou, Kuo-Chen .
ANALYTICAL BIOCHEMISTRY, 2015, 474 :69-77
[69]   A novel representation of protein sequences for prediction of subcellular location using support vector machines [J].
Matsuda, S ;
Vert, JP ;
Saigo, H ;
Ueda, N ;
Toh, H ;
Akutsu, T .
PROTEIN SCIENCE, 2005, 14 (11) :2804-2813
[70]   Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou's general PseAAC [J].
Meher, Prabina Kumar ;
Sahu, Tanmaya Kumar ;
Saini, Varsha ;
Rao, Atmakuri Ramakrishna .
SCIENTIFIC REPORTS, 2017, 7