Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naive Bayes

被引:140
|
作者
Lou, Wangchao [1 ]
Wang, Xiaoqing [1 ]
Chen, Fan [1 ]
Chen, Yixiao [1 ]
Jiang, Bo [1 ]
Zhang, Hua [1 ]
机构
[1] Zhejiang Gongshang Univ, Sch Comp & Informat Engn, Hangzhou, Zhejiang, Peoples R China
来源
PLOS ONE | 2014年 / 9卷 / 01期
基金
中国国家自然科学基金;
关键词
RIBOSOMAL-RNA-BINDING; SECONDARY STRUCTURE; EVOLUTIONARY CONSERVATION; FOLD RECOGNITION; IDENTIFICATION; COVARIANCE; RESOLUTION; ACCURATE; RECEPTORS; DOMAINS;
D O I
10.1371/journal.pone.0086703
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naive Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] DBPboost:A method of classification of DNA-binding proteins based on improved differential evolution algorithm and feature extraction
    Sun, Ailun
    Li, Hongfei
    Dong, Guanghui
    Zhao, Yuming
    Zhang, Dandan
    METHODS, 2024, 223 : 56 - 64
  • [32] FC- SVM : DNA binding Proteins prediction with Average Blocks (AB) descriptors using SVM with FC feature Selection
    Ridok, Achmad
    Widodo, Nashi
    Mahmudy, Wayan Firdaus
    Rifa'i, Muhaimin
    PROCEEDINGS OF 2019 4TH INTERNATIONAL CONFERENCE ON SUSTAINABLE INFORMATION ENGINEERING AND TECHNOLOGY (SIET 2019), 2019, : 22 - 27
  • [33] DNA-binding protein prediction based on deep transfer learning
    Yan, Jun
    Jiang, Tengsheng
    Liu, Junkai
    Lu, Yaoyao
    Guan, Shixuan
    Li, Haiou
    Wu, Hongjie
    Ding, Yijie
    MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2022, 19 (08) : 7719 - 7736
  • [34] A DNA-BINDING PROTEINS PREDICTION MODEL USING DIFFERENT PROPERTY DISTANCE TRANSFORMATION
    Li, Xiangyu
    Yang, Lina
    Tang, Yuan Yan
    Wang, Patrick
    PROCEEDINGS OF 2020 INTERNATIONAL CONFERENCE ON WAVELET ANALYSIS AND PATTERN RECOGNITION (ICWAPR), 2020, : 8 - 13
  • [35] iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model
    Lin, Wei-Zhong
    Fang, Jian-An
    Xiao, Xuan
    Chou, Kuo-Chen
    PLOS ONE, 2011, 6 (09):
  • [36] Using hidden Markov models to predict DNA-binding proteins with sequence and structure information
    Hsu, Yi-Yu
    Chen, Wei-Jhih
    Chen, Shu-Hui
    Kao, Hung-Yu
    SOFT COMPUTING, 2014, 18 (12) : 2365 - 2376
  • [37] StackPDB: Predicting DNA-binding proteins based on XGB-RFE feature optimization and stacked ensemble classifier
    Zhang, Qingmei
    Liu, Peishun
    Wang, Xue
    Zhang, Yaqun
    Han, Yu
    Yu, Bin
    APPLIED SOFT COMPUTING, 2021, 99
  • [38] PseKNC and Adaboost-Based Method for DNA-Binding Proteins Recognition
    Yang, Lina
    Li, Xiangyu
    Shu, Ting
    Wang, Patrick
    Li, Xichun
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2021, 35 (07)
  • [39] AIPpred: Sequence-Based Prediction of Anti-inflammatory Peptides Using Random Forest
    Manavalan, Balachandran
    Shin, Tae H.
    Kim, Myeong O.
    Lee, Gwang
    FRONTIERS IN PHARMACOLOGY, 2018, 9
  • [40] Structure- based prediction of protein-peptide binding regions using Random Forest
    Taherzadeh, Ghazaleh
    Zhou, Yaoqi
    Liew, Alan Wee-Chung
    Yang, Yuedong
    BIOINFORMATICS, 2018, 34 (03) : 477 - 484