pLoc_bal-mAnimal: predict subcellular localization of animal proteins by balancing training dataset and PseAAC

被引:86
作者
Cheng, Xiang [1 ,2 ]
Lin, Wei-Zhong [1 ]
Xiao, Xuan [1 ,2 ]
Chou, Kuo-Chen [2 ,3 ]
机构
[1] Jingdezhen Ceram Inst, Comp Sci, Jingdezhen 333000, Peoples R China
[2] Gordon Life Sci Inst, Computat Biol, Boston, MA 02478 USA
[3] Univ Elect Sci & Technol China, Ctr Informat Biol, Chengdu 610054, Sichuan, Peoples R China
基金
中国国家自然科学基金;
关键词
AMINO-ACID-COMPOSITION; INCORPORATING EVOLUTIONARY INFORMATION; LYSINE SUCCINYLATION SITES; SEQUENCE-BASED PREDICTOR; ENSEMBLE CLASSIFIER; GENERAL-FORM; RECOMBINATION SPOTS; MULTI-LOCALIZATION; MEMBRANE-PROTEINS; CHOUS PSEAAC;
D O I
10.1093/bioinformatics/bty628
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: A cell contains numerous protein molecules. One of the fundamental goals in cell biology is to determine their subcellular locations, which can provide useful clues about their functions. Knowledge of protein subcellular localization is also indispensable for prioritizing and selecting the right targets for drug development. With the avalanche of protein sequences emerging in the post-genomic age, it is highly desired to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called 'pLoc-mAnimal' was developed for identifying the subcellular localization of animal proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with the multi-label systems in which some proteins, called 'multiplex proteins', may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mAnimal was trained by an extremely skewed dataset in which some subset (subcellular location) was about 128 times the size of the other subsets. Accordingly, such an uneven training dataset will inevitably cause a biased consequence. Results: To alleviate such biased consequence, we have developed a new and bias-reducing predictor called pLoc_bal-mAnimal by quasi-balancing the training dataset. Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mAnimal, the existing state-of-the-art predictor, in identifying the subcellular localization of animal proteins.
引用
收藏
页码:398 / 406
页数:9
相关论文
共 104 条
[1]   Prediction of Protein Submitochondrial Locations by Incorporating Dipeptide Composition into Chou's General Pseudo Amino Acid Composition [J].
Ahmad, Khurshid ;
Waris, Muhammad ;
Hayat, Maqsood .
JOURNAL OF MEMBRANE BIOLOGY, 2016, 249 (03) :293-304
[2]   Classification of membrane protein types using Voting Feature Interval in combination with Chou's Pseudo Amino Acid Composition [J].
Ali, Farman ;
Hayat, Maqsood .
JOURNAL OF THEORETICAL BIOLOGY, 2015, 384 :78-83
[3]  
[Anonymous], BIOINFORMATICS
[4]   iMem-2LSAAC: A two-level model for discrimination of membrane proteins and their types by extending the notion of SAAC into chou's pseudo amino acid composition [J].
Arif, Muhammad ;
Hayat, Maqsood ;
Jan, Zahoor .
JOURNAL OF THEORETICAL BIOLOGY, 2018, 442 :11-21
[5]   Using LogitBoost classifier to predict protein structural classes [J].
Cai, YD ;
Feng, KY ;
Lu, WC ;
Chou, KC .
JOURNAL OF THEORETICAL BIOLOGY, 2006, 238 (01) :172-176
[6]   Support vector machines for prediction of protein subcellular location by incorporating quasi-sequence-order effect [J].
Cai, YD ;
Liu, XJ ;
Xu, XB ;
Chou, KC .
JOURNAL OF CELLULAR BIOCHEMISTRY, 2002, 84 (02) :343-348
[7]   Predicting Viral Protein Subcellular Localization with Chou's Pseudo Amino Acid Composition and Imbalance-Weighted Multi-Label K-Nearest Neighbor Algorithm [J].
Cao, Jun-Zhe ;
Liu, Wen-Qi ;
Gu, Hong .
PROTEIN AND PEPTIDE LETTERS, 2012, 19 (11) :1163-1169
[8]   Relation between amino acid composition and cellular location of proteins [J].
Cedano, J ;
Aloy, P ;
PerezPons, JA ;
Querol, E .
JOURNAL OF MOLECULAR BIOLOGY, 1997, 266 (03) :594-600
[9]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[10]   iRNA-3typeA: Identifying Three Types of Modification at RNA's Adenosine Sites [J].
Chen, Wei ;
Feng, Pengmian ;
Yang, Hui ;
Ding, Hui ;
Lin, Hao ;
Chou, Kuo-Chen .
MOLECULAR THERAPY-NUCLEIC ACIDS, 2018, 11 :468-474