Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naive Bayes

被引:140
|
作者
Lou, Wangchao [1 ]
Wang, Xiaoqing [1 ]
Chen, Fan [1 ]
Chen, Yixiao [1 ]
Jiang, Bo [1 ]
Zhang, Hua [1 ]
机构
[1] Zhejiang Gongshang Univ, Sch Comp & Informat Engn, Hangzhou, Zhejiang, Peoples R China
来源
PLOS ONE | 2014年 / 9卷 / 01期
基金
中国国家自然科学基金;
关键词
RIBOSOMAL-RNA-BINDING; SECONDARY STRUCTURE; EVOLUTIONARY CONSERVATION; FOLD RECOGNITION; IDENTIFICATION; COVARIANCE; RESOLUTION; ACCURATE; RECEPTORS; DOMAINS;
D O I
10.1371/journal.pone.0086703
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naive Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins.
引用
收藏
页数:10
相关论文
共 50 条
  • [41] DRBPPred-GAT: Accurate prediction of DNA-binding proteins and RNA-binding proteins based on graph multi-head attention network
    Zhang, Xinyu
    Wang, Yifei
    Wei, Qinqin
    He, Shiyue
    Salhi, Adil
    Yu, Bin
    KNOWLEDGE-BASED SYSTEMS, 2024, 285
  • [42] SVM-based Prediction of the Calpain Degradome using Bayes Feature Extraction
    Wee, L. J. K.
    Low, H. M.
    2012 ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2012, : 5534 - 5540
  • [43] Prediction of cell penetrating peptides and their uptake efficiency using random forest-based feature selections
    Liu, Peng
    Ding, Yijie
    Rong, Ying
    Chen, Dong
    AICHE JOURNAL, 2022, 68 (09)
  • [44] iDRPro-SC: identifying DNA-binding proteins and RNA-binding proteins based on subfunction classifiers
    Yan, Ke
    Feng, Jiawei
    Huang, Jing
    Wu, Hao
    BRIEFINGS IN BIOINFORMATICS, 2023, 24 (04)
  • [45] Predicting DNA-binding protein and coronavirus protein flexibility using protein dihedral angle and sequence feature
    Wang, Wei
    Su, Xili
    Liu, Dong
    Zhang, Hongjun
    Wang, Xianfang
    Zhou, Yun
    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2023, 91 (04) : 497 - 507
  • [46] Improving DNA-Binding Protein Prediction Using Three-Part Sequence-Order Feature Extraction and a Deep Neural Network Algorithm
    Hu, Jun
    Zeng, Wen-Wu
    Jia, Ning-Xin
    Arif, Muhammad
    Yu, Dong-Jun
    Zhang, Gui-Jun
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2023, 63 (03) : 1044 - 1057
  • [47] EmbedCaps-DBP: Predicting DNA-Binding Proteins Using Protein Sequence Embedding and Capsule Network
    Naim, Muhammad Khaerul
    Mengko, Tati Rajab
    Hertadi, Rukman
    Purwarianti, Ayu
    Susanty, Meredita
    IEEE ACCESS, 2023, 11 : 121256 - 121268
  • [48] A Novel Sequence-Based Method of Predicting Protein DNA-Binding Residues, Using a Machine Learning Approach
    Cai, Yudong
    He, ZhiSong
    Shi, Xiaohe
    Kong, Xiangying
    Gu, Lei
    Xie, Lu
    MOLECULES AND CELLS, 2010, 30 (02) : 99 - 105
  • [49] A Model-Free Feature Selection Technique of Feature Screening and Random Forest-Based Recursive Feature Elimination
    Xia, Siwei
    Yang, Yuehan
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2023, 2023
  • [50] DNABind: A hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning- and template-based approaches
    Liu, Rong
    Hu, Jianjun
    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2013, 81 (11) : 1885 - 1899