Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation

被引:72
作者
Xu, Ruifeng [1 ,2 ]
Zhou, Jiyun [1 ]
Wang, Hongpeng [1 ]
He, Yulan [3 ]
Wang, Xiaolong [1 ,2 ]
Liu, Bin [1 ,2 ]
机构
[1] Harbin Inst Technol, Shenzhen Grad Sch, Sch Comp Sci & Technol, Shenzhen, Guangdong, Peoples R China
[2] Harbin Inst Technol, Key Lab Network Oriented Intelligent Computat, Shenzhen Grad Sch, Shenzhen, Guangdong, Peoples R China
[3] Aston Univ, Sch Engn & Appl Sci, Birmingham B4 7ET, W Midlands, England
基金
中国国家自然科学基金;
关键词
REMOTE HOMOLOGY DETECTION; SEQUENCE-BASED PREDICTOR; AMINO-ACID-COMPOSITION; RNA-BINDING; WEB SERVER; IDENTIFICATION; GENOME; SITES; CLASSIFIER; PROTOCOL;
D O I
10.1186/1752-0509-9-S1-S10
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background:: DNA-binding proteins play a pivotal role in various intra-and extra-cellular activities ranging from DNA replication to gene expression control. Identification of DNA-binding proteins is one of the major challenges in the field of genome annotation. There have been several computational methods proposed in the literature to deal with the DNA-binding protein identification. However, most of them can't provide an invaluable knowledge base for our understanding of DNA-protein interactions. Results:: We firstly presented a new protein sequence encoding method called PSSM Distance Transformation, and then constructed a DNA-binding protein identification method (SVM-PSSM-DT) by combining PSSM Distance Transformation with support vector machine (SVM). First, the PSSM profiles are generated by using the PSI-BLAST program to search the non-redundant (NR) database. Next, the PSSM profiles are transformed into uniform numeric representations appropriately by distance transformation scheme. Lastly, the resulting uniform numeric representations are inputted into a SVM classifier for prediction. Thus whether a sequence can bind to DNA or not can be determined. In benchmark test on 525 DNA-binding and 550 non DNA-binding proteins using jackknife validation, the present model achieved an ACC of 79.96%, MCC of 0.622 and AUC of 86.50%. This performance is considerably better than most of the existing state-of-the-art predictive methods. When tested on a recently constructed independent dataset PDB186, SVM-PSSM-DT also achieved the best performance with ACC of 80.00%, MCC of 0.647 and AUC of 87.40%, and outperformed some existing state-of-the-art methods. Conclusions:: The experiment results demonstrate that PSSM Distance Transformation is an available protein sequence encoding method and SVM-PSSM-DT is a useful tool for identifying the DNA-binding proteins. A user-friendly web-server of SVM-PSSM-DT was constructed, which is freely accessible to the public at the web-site on http://bioinformatics.hitsz.edu.cn/PSSM-DT/.
引用
收藏
页数:12
相关论文
共 78 条
[1]   Moment-based prediction of DNA-binding proteins [J].
Ahmad, S ;
Sarai, A .
JOURNAL OF MOLECULAR BIOLOGY, 2004, 341 (01) :65-71
[2]   Prediction of mono- and di-nucleotide-specific DNA-binding sites in proteins using neural networks [J].
Andrabi, Munazah ;
Mizuguchi, Kenji ;
Sarai, Akinori ;
Ahmad, Shandar .
BMC STRUCTURAL BIOLOGY, 2009, 9
[3]   Kernel-based machine learning protocol for predicting DNA-binding proteins [J].
Bhardwaj, N ;
Langlois, RE ;
Zhao, GJ ;
Lu, H .
NUCLEIC ACIDS RESEARCH, 2005, 33 (20) :6486-6493
[4]   Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information [J].
Biswas, Ashis Kumer ;
Noman, Nasimul ;
Sikder, Abdur Rahman .
BMC BIOINFORMATICS, 2010, 11
[5]   Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology [J].
Brown, J. B. ;
Akutsu, Tatsuya .
BMC BIOINFORMATICS, 2009, 10
[6]   Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence [J].
Cai, YD ;
Lin, SL .
BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS, 2003, 1648 (1-2) :127-133
[7]   iTIS-PseTNC: A sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition [J].
Chen, Wei ;
Feng, Peng-Mian ;
Deng, En-Ze ;
Lin, Hao ;
Chou, Kuo-Chen .
ANALYTICAL BIOCHEMISTRY, 2014, 462 :76-83
[8]   iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition [J].
Chen, Wei ;
Feng, Peng-Mian ;
Lin, Hao ;
Chou, Kuo-Chen .
NUCLEIC ACIDS RESEARCH, 2013, 41 (06) :e68
[9]   iNuc-PhysChem: A Sequence-Based Predictor for Identifying Nucleosomes via Physicochemical Properties [J].
Chen, Wei ;
Lin, Hao ;
Feng, Peng-Mian ;
Ding, Chen ;
Zuo, Yong-Chun ;
Chou, Kuo-Chen .
PLOS ONE, 2012, 7 (10)
[10]   Some remarks on protein attribute prediction and pseudo amino acid composition [J].
Chou, Kuo-Chen .
JOURNAL OF THEORETICAL BIOLOGY, 2011, 273 (01) :236-247