Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naive Bayes

被引:140
|
作者
Lou, Wangchao [1 ]
Wang, Xiaoqing [1 ]
Chen, Fan [1 ]
Chen, Yixiao [1 ]
Jiang, Bo [1 ]
Zhang, Hua [1 ]
机构
[1] Zhejiang Gongshang Univ, Sch Comp & Informat Engn, Hangzhou, Zhejiang, Peoples R China
来源
PLOS ONE | 2014年 / 9卷 / 01期
基金
中国国家自然科学基金;
关键词
RIBOSOMAL-RNA-BINDING; SECONDARY STRUCTURE; EVOLUTIONARY CONSERVATION; FOLD RECOGNITION; IDENTIFICATION; COVARIANCE; RESOLUTION; ACCURATE; RECEPTORS; DOMAINS;
D O I
10.1371/journal.pone.0086703
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naive Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] Sequence Based Prediction of Antioxidant Proteins Using a Classifier Selection Strategy
    Zhang, Lina
    Zhang, Chengjin
    Gao, Rui
    Yang, Runtao
    Song, Qing
    PLOS ONE, 2016, 11 (09):
  • [22] Comparative Analysis on Alignment-Based and Pretrained Feature Representations for the Identification of DNA-Binding Proteins
    Chen, Die
    Zhang, Hua
    Chen, Zeqi
    Xie, Bo
    Wang, Ye
    COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE, 2022, 2022
  • [23] StackDPP: a stacking ensemble based DNA-binding protein prediction model
    Ahmed, Sheikh Hasib
    Bose, Dibyendu Brinto
    Khandoker, Rafi
    Rahman, M. Saifur
    BMC BIOINFORMATICS, 2024, 25 (01)
  • [24] Predicting a DNA-binding protein using random forest with multiple mathematical features
    Guan, Changge
    Niu, Xiaohui
    Shi, Feng
    Yang, Kun
    Li, Nana
    BIO-MEDICAL MATERIALS AND ENGINEERING, 2015, 26 : S1883 - S1889
  • [25] DBP-DeepCNN: Prediction of DNA-binding proteins using wavelet-based denoising and deep learning
    Ali, Farman
    Kumar, Harish
    Patil, Shruti
    Ahmed, Aftab
    Banjar, Ameen
    Daud, Ali
    CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2022, 229
  • [26] The Prediction of Intrinsically Disordered Proteins Based on Feature Selection
    He, Hao
    Zhao, Jiaxiang
    Sun, Guiling
    ALGORITHMS, 2019, 12 (02):
  • [27] Random Fourier features-based sparse representation classifier for identifying DNA-binding proteins
    Guo, Xiaoyi
    Tiwari, Prayag
    Zhang, Ying
    Han, Shuguang
    Wang, Yansu
    Ding, Yijie
    COMPUTERS IN BIOLOGY AND MEDICINE, 2022, 151
  • [28] UMAP-DBP: An Improved DNA-Binding Proteins Prediction Method Based on Uniform Manifold Approximation and Projection
    Wang, Jinyue
    Zhang, Shengli
    Qiao, Huijuan
    Wang, Jiesheng
    PROTEIN JOURNAL, 2021, 40 (04) : 562 - 575
  • [29] HMMBinder: DNA-Binding Protein Prediction Using HMM Profile Based Features
    Zaman, Rianon
    Chowdhury, Shahana Yasmin
    Rashid, Mahmood A.
    Sharma, Alok
    Dehzangi, Abdollah
    Shatabda, Swakkhar
    BIOMED RESEARCH INTERNATIONAL, 2017, 2017
  • [30] Identification of DNA-Binding Proteins Using Support Vector Machine with Sequence Information
    Ma, Xin
    Wu, Jiansheng
    Xue, Xiaoyun
    COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE, 2013, 2013