Protein sumoylation sites prediction based on two-stage feature selection

被引:27
|
作者
Lu, Lin [3 ]
Shi, Xiao-He [5 ]
Li, Su-Jun [7 ]
Xie, Zhi-Qun [1 ]
Feng, Yong-Li [6 ]
Lu, Wen-Cong [6 ]
Li, Yi-Xue [4 ,7 ]
Li, Haipeng [1 ]
Cai, Yu-Dong [1 ,2 ]
机构
[1] Chinese Acad Sci, Shanghai Inst Biol Sci, MPG Partner Inst Computat Biol, Shanghai 200031, Peoples R China
[2] Shanghai Univ, Inst Syst Biol, Shanghai 200244, Peoples R China
[3] Shanghai Jiao Tong Univ, Dept Biomed Engn, Shanghai 200240, Peoples R China
[4] Sch Shanghai Jiao Tong Univ, Shanghai 200240, Peoples R China
[5] Chinese Acad Sci, Shanghai Inst Biol Sci, Inst Hlth Sci, Shanghai 200025, Peoples R China
[6] Coll Sci, Dept Chem, Shanghai 200444, Peoples R China
[7] Chinese Acad Sci, Shanghai Inst Biol Sci, Key Lab Syst Biol, Shanghai 200031, Peoples R China
关键词
Prediction; Protein sumoylation; mRMR; AAIndex; Nearest Neighbor Algorithm; Leave-one-out cross-validation; Bioinformatics; ACID INDEX DATABASE; SUMO; CONJUGATION; AAINDEX; UBC9;
D O I
10.1007/s11030-009-9149-5
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Protein sumoylation is one of the most important post-translational modifications. Accurate prediction of sumoylation sites is very useful for the analysis of proteome. Though the putative motif IK XE can be used, optimization of prediction models still remains a challenge. In this study, we developed a prediction system based on feature selection strategy. A total of 1,272 peptides with 14 residues from SUMOsp (Xue et al. [8] Nucleic Acids Res 34:W254-W257, 2006) were investigated in this study, including 212 substrates and 1,060 non-substrates. Among the substrates, only 162 substrates comply to the motif IK XE. First, 1,272 substrates were divided into training set and test set. All the substrates were encoded into feature vectors by hundreds of amino acid properties collected by Amino Acid Index Database (AAIndex, http://www.genome.jp/aaindex ). Then, mRMR (minimum redundancy-maximum relevance) method was applied to extract the most informative features. Finally, Nearest Neighbor Algorithm (NNA) was used to produce the prediction models. Tested by Leave-one-out (LOO) cross-validation, the optimal prediction model reaches the accuracy of 84.4% for the training set and 76.4% for the test set. Especially, 180 substrates were correctly predicted, which was 18 more than using the motif IK XE. The final selected features indicate that amino acid residues with two-residue downstream and one-residue upstream of the sumoylation sites play the most important role in determining the occurrence of sumoylation. Based on the feature selection strategy, our prediction system can not only be used for high throughput prediction of sumoylation sites but also as a tool to investigate the mechanism of sumoylation.
引用
收藏
页码:81 / 86
页数:6
相关论文
共 50 条
  • [21] A two-stage Markov blanket based feature selection algorithm for text classification
    Javed, Kashif
    Maruf, Sameen
    Babri, Haroon A.
    NEUROCOMPUTING, 2015, 157 : 91 - 104
  • [22] Prediction of protein amidation sites by feature selection and analysis
    Weiren Cui
    Shen Niu
    Lulu Zheng
    Lele Hu
    Tao Huang
    Lei Gu
    Kaiyan Feng
    Ning Zhang
    Yudong Cai
    Yixue Li
    Molecular Genetics and Genomics, 2013, 288 : 391 - 400
  • [23] Prediction of protein amidation sites by feature selection and analysis
    Cui, Weiren
    Niu, Shen
    Zheng, Lulu
    Hu, Lele
    Huang, Tao
    Gu, Lei
    Feng, Kaiyan
    Zhang, Ning
    Cai, Yudong
    Li, Yixue
    MOLECULAR GENETICS AND GENOMICS, 2013, 288 (09) : 391 - 400
  • [24] Two-stage variable selection for molecular prediction of disease
    Firouzi, Hamed
    Rajaratnam, Bala
    Hero, Alfred O., III
    2013 IEEE 5TH INTERNATIONAL WORKSHOP ON COMPUTATIONAL ADVANCES IN MULTI-SENSOR ADAPTIVE PROCESSING (CAMSAP 2013), 2013, : 169 - +
  • [25] A novel deep learning ensemble model based on two-stage feature selection and intelligent optimization for water quality prediction
    Liu, Wenli
    Liu, Tianxiang
    Liu, Zihan
    Luo, Hanbin
    Pei, Hanmin
    ENVIRONMENTAL RESEARCH, 2023, 224
  • [26] A two-stage hybrid approach for feature selection in microarray analysis
    Lee, Chung-Hong
    Yang, Hsin-Chang
    Wu, Chih-Hong
    Lan, Yi-Chia
    HIS 2009: 2009 NINTH INTERNATIONAL CONFERENCE ON HYBRID INTELLIGENT SYSTEMS, VOL 1, PROCEEDINGS, 2009, : 188 - +
  • [27] Two-stage classification with automatic feature selection for an industrial application
    Hader, S
    Hamprecht, FA
    Classification - the Ubiquitous Challenge, 2005, : 137 - 144
  • [28] A two-stage feature selection method for hob state recognition
    Jia, Yachao
    Li, Guolong
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 133
  • [29] A Two-Stage Feature Selection Method for Gene Expression Data
    Chuang, Li-Yeh
    Ke, Chao-Hsuan
    Chang, Hsueh-Wei
    Yang, Cheng-Hong
    OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY, 2009, 13 (02) : 127 - 137
  • [30] A Novel Two-Stage Selection of Feature Subsets in Machine Learning
    Kamala, F. Rosita
    Thangaiah, P. Ranjit Jeba
    ENGINEERING TECHNOLOGY & APPLIED SCIENCE RESEARCH, 2019, 9 (03) : 4169 - 4175