R3P-Loc: A compact multi-label predictor using ridge regression and random projection for protein subcellular localization

被引:30
作者
Wan, Shibiao [1 ]
Mak, Man-Wai [1 ]
Kung, Sun-Yuan [2 ]
机构
[1] Hong Kong Polytech Univ, Dept Elect & Informat Engn, Hong Kong, Hong Kong, Peoples R China
[2] Princeton Univ, Dept Elect Engn, Princeton, NJ 08544 USA
关键词
Multi-location proteins; Compact databases; Multi-label classification; AMINO-ACID-COMPOSITION; GENE ONTOLOGY; JOHNSON-LINDENSTRAUSS; LEARNING CLASSIFIER; LOCATION; SINGLE; PSEAAC; DATABASE; SITES; PLANT;
D O I
10.1016/j.jtbi.2014.06.031
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Locating proteins within cellular contexts is of paramount significance in elucidating their biological functions. Computational methods based on knowledge databases (such as gene ontology annotation (GOA) database) are known to be more efficient than sequence-based methods. However, the predominant scenarios of knowledge-based methods are that (1) knowledge databases typically have enormous size and are growing exponentially, (2) knowledge databases contain redundant information, and (3) the number of extracted features from knowledge databases is much larger than the number of data samples with ground-truth labels. These properties render the extracted features liable to redundant or irrelevant information, causing the prediction systems suffer from overfitting. To address these problems, this paper proposes an efficient multi-label predictor, namely R3P-Loc, which uses two compact databases for feature extraction and applies random projection (RP) to reduce the feature dimensions of an ensemble ridge regression (RR) classifier. Two new compact databases are created from Swiss-Prot and GOA databases. These databases possess almost the same amount of information as their full-size counterparts but with much smaller size. Experimental results on two recent datasets (eukaryote and plant) suggest that R3P-Loc can reduce the dimensions by seven-folds and significantly outperforms state-of-the-art predictors. This paper also demonstrates that the compact databases reduce the memory consumption by 39 times without causing degradation in prediction accuracy. For readers' convenience, the R3P-Loc server is available online at url:http://bioinfo.eie.polyu.edu.hk/ R3PLocServer/. (C) 2014 Elsevier Ltd. All rights reserved.
引用
收藏
页码:34 / 45
页数:12
相关论文
共 95 条
  • [1] Database-friendly random projections: Johnson-Lindenstrauss with binary coins
    Achlioptas, D
    [J]. JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 2003, 66 (04) : 671 - 687
  • [2] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [3] [Anonymous], 2011, P USENIX ANN TECHN C
  • [4] [Anonymous], PLOS ONE
  • [5] [Anonymous], 2011, P 24 ANN C LEARN THE
  • [6] Bingham E., 2001, KDD-2001. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, P245, DOI 10.1145/502512.502546
  • [7] Brady Scott, 2008, Pac Symp Biocomput, P604
  • [8] SherLoc2: A High-Accuracy Hybrid Method for Predicting Subcellular Localization of Proteins
    Briesemeister, Sebastian
    Blum, Torsten
    Brady, Scott
    Lam, Yin
    Kohlbacher, Oliver
    Shatkay, Hagit
    [J]. JOURNAL OF PROTEOME RESEARCH, 2009, 8 (11) : 5363 - 5366
  • [9] S-100 PROTEIN LOCALIZATION IN MINOR SALIVARY-GLAND TUMORS - AN AID TO DIAGNOSIS
    CAMPBELL, JB
    CROCKER, J
    SHENOI, PM
    [J]. JOURNAL OF LARYNGOLOGY AND OTOLOGY, 1988, 102 (10) : 905 - 908
  • [10] Near-optimal signal recovery from random projections: Universal encoding strategies?
    Candes, Emmanuel J.
    Tao, Terence
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2006, 52 (12) : 5406 - 5425