PSLOoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis

被引:38
作者
Chang, Jia-Ming [1 ]
Su, Emily Chia-Yu [2 ,3 ]
Lo, Allan [2 ,4 ]
Chiu, Hua-Sheng [1 ]
Sung, Ting-Yi [1 ]
Hsu, Wen-Lian [1 ]
机构
[1] Acad Sinica, Inst Informat Sci, Bioinformat Lab, Taipei 115, Taiwan
[2] Acad Sinica, Taiwan Int Grad Program, Bioinformat Program, Taipei 115, Taiwan
[3] Natl Chiao Tung Univ, Inst Bioinformat, Hsinchu, Taiwan
[4] Natl Tsing Hua Univ, Dept Life Sci, Hsinchu, Taiwan
关键词
protein subcellular localization; document classification; vector space model; gapped-dipeptides; probabilistic latent semantic analysis; support vector machines;
D O I
10.1002/prot.21944
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Prediction of protein subcellular localization (PSL) is important for genome annotation, protein function prediction, and drug discovery. Many computational approaches for PSL prediction based on protein sequences have been proposed in recent years for Gram-negative bacteria. We present PSLDoc, a method based on gapped-dipeptides and probabilistic latent semantic analysis (PLSA) to solve this problem. A protein is considered as a term string composed by gapped-dipeptides, which are defined as any two residues separated by one or more positions. The weighting scheme of gapped-dipeptides is calculated according to a position specific score matrix, which includes sequence evolutionary information. Then, PLSA is applied for feature reduction, and reduced vectors are input to five one-versus-rest support vector machine classifiers. The localization site with the highest probability is assigned as the final prediction. It has been reported that there is a strong correlation between sequence homology and subcellular localization (Nair and Rost, Protein Sci 2002;11:2836-2847, Yu et al., Proteins 2006;64:643-651). To properly evaluate the performance of PSLDoc, a target protein can be classified into low- or high-homology data sets. PSLDoc's overall accuracy of low- and high-homology data sets reaches 86.84% and 98.219% respectively, and it compares favorably with that of CELLO H (Yu et al., Proteins 2006,64:643-651). In addition, we set a confidence threshold to achieve a high precision at specified levels of recall rates. When the confidence threshold is set at 0.7, PSLDoc achieves 97.89% in precision which is considerably better than that of PSORTb v.2.0 (Gardy et al., Bioinformatics 2005,21:617-623). Our approach demonstrates that the specific feature representation for proteins can be successfully applied to the prediction of protein subcellular localization and improves prediction accuracy. Besides, because of the generality of the representation, our method can be extended to eukaryotic proteomes in the future. The web server of PSLDoc is publicly available at http://bio-cluster.iis.sinica.edu.tw/similar to bioapp/PSLDoc/.
引用
收藏
页码:693 / 710
页数:18
相关论文
共 55 条
  • [1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [2] PSLpred: prediction of subcellular localization of bacterial proteins
    Bhasin, M
    Garg, A
    Raghava, GPS
    [J]. BIOINFORMATICS, 2005, 21 (10) : 2522 - 2524
  • [3] REGULATION OF CYTOPLASMIC PH IN BACTERIA
    BOOTH, IR
    [J]. MICROBIOLOGICAL REVIEWS, 1985, 49 (04) : 359 - 378
  • [4] Relation between amino acid composition and cellular location of proteins
    Cedano, J
    Aloy, P
    PerezPons, JA
    Querol, E
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1997, 266 (03) : 594 - 600
  • [5] LIBSVM: A Library for Support Vector Machines
    Chang, Chih-Chung
    Lin, Chih-Jen
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
  • [6] Protein classification based on text document classification techniques
    Cheng, BYM
    Carbonell, JG
    Klein-Seetharaman, J
    [J]. PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2005, 58 (04) : 955 - 970
  • [7] Protein subcellular location prediction
    Chou, KC
    Elrod, DW
    [J]. PROTEIN ENGINEERING, 1999, 12 (02): : 107 - 118
  • [8] Cohen J., 1988, POWERSTATISTICALSCIE, DOI 10.4324/9780203771587
  • [9] Costa EP, 2007, LECT NOTES COMPUT SC, V4643, P126
  • [10] Cuff JA, 1999, PROTEINS, V34, P508, DOI 10.1002/(SICI)1097-0134(19990301)34:4<508::AID-PROT10>3.0.CO