An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis

被引:64
作者
Zou, Chuanxin [1 ]
Gong, Jiayu [1 ]
Li, Honglin [1 ]
机构
[1] E China Univ Sci & Technol, Shanghai Key Lab New Drug Design, State Key Lab Bioreactor Engn, Sch Pharm, Shanghai 200237, Peoples R China
基金
中国国家自然科学基金;
关键词
SUPPORT VECTOR MACHINES; SUBCELLULAR-LOCALIZATION; WEB SERVER; DIPEPTIDE COMPOSITION; SECONDARY STRUCTURE; RNA-BINDING; ACID; IDENTIFICATION; SITES; REPRESENTATION;
D O I
10.1186/1471-2105-14-90
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: DNA-binding proteins (DNA-BPs) play a pivotal role in both eukaryotic and prokaryotic proteomes. There have been several computational methods proposed in the literature to deal with the DNA-BPs, many informative features and properties were used and proved to have significant impact on this problem. However the ultimate goal of Bioinformatics is to be able to predict the DNA-BPs directly from primary sequence. Results: In this work, the focus is how to transform these informative features into uniform numeric representation appropriately and improve the prediction accuracy of our SVM-based classifier for DNA-BPs. A systematic representation of some selected features known to perform well is investigated here. Firstly, four kinds of protein properties are obtained and used to describe the protein sequence. Secondly, three different feature transformation methods (OCTD, AC and SAA) are adopted to obtain numeric feature vectors from three main levels: Global, Nonlocal and Local of protein sequence and their performances are exhaustively investigated. At last, the mRMR-IFS feature selection method and ensemble learning approach are utilized to determine the best prediction model. Besides, the optimal features selected by mRMR-IFS are illustrated based on the observed results which may provide useful insights for revealing the mechanisms of protein-DNA interactions. For five-fold cross-validation over the DNAdset and DNAaset, we obtained an overall accuracy of 0.940 and 0.811, MCC of 0.881 and 0.614 respectively. Conclusions: The good results suggest that it can efficiently develop an entirely sequence-based protocol that transforms and integrates informative features from different scales used by SVM to predict DNA-BPs accurately. Moreover, a novel systematic framework for sequence descriptor-based protein function prediction is proposed here.
引用
收藏
页数:14
相关论文
共 68 条
[1]   Mito-GSAAC: mitochondria prediction using genetic ensemble classifier and split amino acid composition [J].
Afridi, Tariq Habib ;
Khan, Asifullah ;
Lee, Yeon Soo .
AMINO ACIDS, 2012, 42 (04) :1443-1454
[2]   Moment-based prediction of DNA-binding proteins [J].
Ahmad, S ;
Sarai, A .
JOURNAL OF MOLECULAR BIOLOGY, 2004, 341 (01) :65-71
[3]   Prediction of mono- and di-nucleotide-specific DNA-binding sites in proteins using neural networks [J].
Andrabi, Munazah ;
Mizuguchi, Kenji ;
Sarai, Akinori ;
Ahmad, Shandar .
BMC STRUCTURAL BIOLOGY, 2009, 9
[4]   Kernel-based machine learning protocol for predicting DNA-binding proteins [J].
Bhardwaj, N ;
Langlois, RE ;
Zhao, GJ ;
Lu, H .
NUCLEIC ACIDS RESEARCH, 2005, 33 (20) :6486-6493
[5]   ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST [J].
Bhasin, M ;
Raghava, GPS .
NUCLEIC ACIDS RESEARCH, 2004, 32 :W414-W419
[6]   Automatic discovery of cross-family sequence features associated with protein function [J].
Brameier, M ;
Haan, J ;
Krings, A ;
MacCallum, RM .
BMC BIOINFORMATICS, 2006, 7 (1)
[7]   Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology [J].
Brown, J. B. ;
Akutsu, Tatsuya .
BMC BIOINFORMATICS, 2009, 10
[8]   SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence [J].
Cai, CZ ;
Han, LY ;
Ji, ZL ;
Chen, X ;
Chen, YZ .
NUCLEIC ACIDS RESEARCH, 2003, 31 (13) :3692-3697
[9]   Prediction of Saccharomyces cerevisiae protein functional class from functional domain composition [J].
Cai, YD ;
Doig, AJ .
BIOINFORMATICS, 2004, 20 (08) :1292-1300
[10]   Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence [J].
Cai, YD ;
Lin, SL .
BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS, 2003, 1648 (1-2) :127-133