Prediction of Lysine Ubiquitylation with Ensemble Classifier and Feature Selection

被引:47
作者
Zhao, Xiaowei [1 ,2 ]
Li, Xiangtao [1 ,2 ]
Ma, Zhiqiang [1 ,2 ]
Yin, Minghao [2 ]
机构
[1] NE Normal Univ, Coll Life Sci, Changchun 130024, Peoples R China
[2] NE Normal Univ, Coll Comp Sci, Changchun 130117, Peoples R China
基金
中国国家自然科学基金;
关键词
ubiquitylation; ensemble classifier; support vector machine; lysine ubiquitylation sites; UBIQUITIN-LIKE PROTEINS; PROTEOMICS APPROACH; INTRINSIC DISORDER; IDENTIFICATION; RELEVANCE; LOCATION;
D O I
10.3390/ijms12128347
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Ubiquitylation is an important process of post-translational modification. Correct identification of protein lysine ubiquitylation sites is of fundamental importance to understand the molecular mechanism of lysine ubiquitylation in biological systems. This paper develops a novel computational method to effectively identify the lysine ubiquitylation sites based on the ensemble approach. In the proposed method, 468 ubiquitylation sites from 323 proteins retrieved from the Swiss-Prot database were encoded into feature vectors by using four kinds of protein sequences information. An effective feature selection method was then applied to extract informative feature subsets. After different feature subsets were obtained by setting different starting points in the search procedure, they were used to train multiple random forests classifiers and then aggregated into a consensus classifier by majority voting. Evaluated by jackknife tests and independent tests respectively, the accuracy of the proposed predictor reached 76.82% for the training dataset and 79.16% for the test dataset, indicating that this predictor is a useful tool to predict lysine ubiquitylation sites. Furthermore, site-specific feature analysis was performed and it was shown that ubiquitylation is intimately correlated with the features of its surrounding sites in addition to features derived from the lysine site itself. The feature selection method is available upon request.
引用
收藏
页码:8347 / 8361
页数:15
相关论文
共 58 条
  • [1] Ubiquitin: not just for proteasomes anymore
    Aguilar, RC
    Wendland, B
    [J]. CURRENT OPINION IN CELL BIOLOGY, 2003, 15 (02) : 184 - 190
  • [2] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [3] Predicting protein structural class by SVM with class-wise optimized features and decision probabilities
    Anand, Ashish
    Pugalenthi, Ganesan
    Suganthan, P. N.
    [J]. JOURNAL OF THEORETICAL BIOLOGY, 2008, 253 (02) : 375 - 380
  • [4] [Anonymous], RANDOMFOREST MATLAB
  • [5] [Anonymous], 1991, ELEMENTS INFORM THEO, DOI [DOI 10.1002/0471200611, 10.1002/0471200611]
  • [6] Solving the protein sequence metric problem
    Atchley, WR
    Zhao, JP
    Fernandes, AD
    Drüke, T
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2005, 102 (18) : 6395 - 6400
  • [7] The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003
    Boeckmann, B
    Bairoch, A
    Apweiler, R
    Blatter, MC
    Estreicher, A
    Gasteiger, E
    Martin, MJ
    Michoud, K
    O'Donovan, C
    Phan, I
    Pilbout, S
    Schneider, M
    [J]. NUCLEIC ACIDS RESEARCH, 2003, 31 (01) : 365 - 370
  • [8] Assessment of disorder predictions in CASP7
    Bordoli, Lorenza
    Kiefer, Florian
    Schwede, Torsten
    [J]. PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2007, 69 : 129 - 136
  • [9] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [10] Breiman L., RANDOM FORESTS