Protein sumoylation sites prediction based on two-stage feature selection

被引:27
|
作者
Lu, Lin [3 ]
Shi, Xiao-He [5 ]
Li, Su-Jun [7 ]
Xie, Zhi-Qun [1 ]
Feng, Yong-Li [6 ]
Lu, Wen-Cong [6 ]
Li, Yi-Xue [4 ,7 ]
Li, Haipeng [1 ]
Cai, Yu-Dong [1 ,2 ]
机构
[1] Chinese Acad Sci, Shanghai Inst Biol Sci, MPG Partner Inst Computat Biol, Shanghai 200031, Peoples R China
[2] Shanghai Univ, Inst Syst Biol, Shanghai 200244, Peoples R China
[3] Shanghai Jiao Tong Univ, Dept Biomed Engn, Shanghai 200240, Peoples R China
[4] Sch Shanghai Jiao Tong Univ, Shanghai 200240, Peoples R China
[5] Chinese Acad Sci, Shanghai Inst Biol Sci, Inst Hlth Sci, Shanghai 200025, Peoples R China
[6] Coll Sci, Dept Chem, Shanghai 200444, Peoples R China
[7] Chinese Acad Sci, Shanghai Inst Biol Sci, Key Lab Syst Biol, Shanghai 200031, Peoples R China
关键词
Prediction; Protein sumoylation; mRMR; AAIndex; Nearest Neighbor Algorithm; Leave-one-out cross-validation; Bioinformatics; ACID INDEX DATABASE; SUMO; CONJUGATION; AAINDEX; UBC9;
D O I
10.1007/s11030-009-9149-5
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Protein sumoylation is one of the most important post-translational modifications. Accurate prediction of sumoylation sites is very useful for the analysis of proteome. Though the putative motif IK XE can be used, optimization of prediction models still remains a challenge. In this study, we developed a prediction system based on feature selection strategy. A total of 1,272 peptides with 14 residues from SUMOsp (Xue et al. [8] Nucleic Acids Res 34:W254-W257, 2006) were investigated in this study, including 212 substrates and 1,060 non-substrates. Among the substrates, only 162 substrates comply to the motif IK XE. First, 1,272 substrates were divided into training set and test set. All the substrates were encoded into feature vectors by hundreds of amino acid properties collected by Amino Acid Index Database (AAIndex, http://www.genome.jp/aaindex ). Then, mRMR (minimum redundancy-maximum relevance) method was applied to extract the most informative features. Finally, Nearest Neighbor Algorithm (NNA) was used to produce the prediction models. Tested by Leave-one-out (LOO) cross-validation, the optimal prediction model reaches the accuracy of 84.4% for the training set and 76.4% for the test set. Especially, 180 substrates were correctly predicted, which was 18 more than using the motif IK XE. The final selected features indicate that amino acid residues with two-residue downstream and one-residue upstream of the sumoylation sites play the most important role in determining the occurrence of sumoylation. Based on the feature selection strategy, our prediction system can not only be used for high throughput prediction of sumoylation sites but also as a tool to investigate the mechanism of sumoylation.
引用
收藏
页码:81 / 86
页数:6
相关论文
共 50 条
  • [31] Two-Stage Botnet Detection Method Based on Feature Selection for Industrial Internet of Things
    Shu, Jian
    Lu, Jiazhong
    IET INFORMATION SECURITY, 2025, 2025 (01)
  • [32] Fusion-based speech emotion classification using two-stage feature selection
    Xie, Jie
    Zhu, Mingying
    Hu, Kai
    SPEECH COMMUNICATION, 2023, 152
  • [33] A TWO-STAGE IMPROVED ANT COLONY OPTIMIZATION BASED FEATURE SELECTION FOR WEB CLASSIFICATION
    Xu, Jun
    Li, Guangyao
    INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2016, 12 (06): : 1851 - 1863
  • [34] Computational Prediction of Protein Epsilon Lysine Acetylation Sites Based on a Feature Selection Method
    Gao, Jianzhao
    Tao, Xue-Wen
    Zhao, Jia
    Feng, Yuan-Ming
    Cai, Yu-Dong
    Zhang, Ning
    COMBINATORIAL CHEMISTRY & HIGH THROUGHPUT SCREENING, 2017, 20 (07) : 629 - 637
  • [35] A Two-Stage Neural Network Based Technique for Protein Secondary Structure Prediction
    Kakumani, Rajasekhar
    Devabbaktuni, Vijay
    Ahmad, M. Omair
    2008 30TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY, VOLS 1-8, 2008, : 1355 - 1358
  • [36] A Two-stage Text Feature Selection Algorithm for Improving Text Classification
    Ashokkumar, P.
    Shankar, Siva G.
    Srivastava, Gautam
    Maddikunta, Praveen Kumar Reddy
    Gadekallu, Thippa Reddy
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2021, 20 (03)
  • [37] A Hybrid Two-Stage Teaching-Learning-Based Optimization Algorithm for Feature Selection in Bioinformatics
    Kang, Yan
    Wang, Haining
    Pu, Bin
    Tao, Liu
    Chen, Jianguo
    Yu, Philip S.
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2023, 20 (03) : 1746 - 1760
  • [38] Model forecasting based on two-stage feature selection procedure using orthogonal greedy algorithm
    Jiang, He
    APPLIED SOFT COMPUTING, 2018, 63 : 110 - 123
  • [39] Electricity Load Forecasting Using Non-decimated Wavelet Prediction Methods With Two-Stage Feature Selection
    Rana, Mashud
    Koprinska, Irena
    2012 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2012,
  • [40] Improving software vulnerability severity prediction model performance with HDLN & FWFS: a two-stage feature selection approach
    Malhotra, Ruchika
    Vidushi
    SOFTWARE QUALITY JOURNAL, 2025, 33 (01)