Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation

被引：72

作者：

Xu, Ruifeng ^{[1
,2
]}

Zhou, Jiyun ^{[1
]}

Wang, Hongpeng ^{[1
]}

He, Yulan ^{[3
]}

Wang, Xiaolong ^{[1
,2
]}

Liu, Bin ^{[1
,2
]}

机构：

[1] Harbin Inst Technol, Shenzhen Grad Sch, Sch Comp Sci & Technol, Shenzhen, Guangdong, Peoples R China

[2] Harbin Inst Technol, Key Lab Network Oriented Intelligent Computat, Shenzhen Grad Sch, Shenzhen, Guangdong, Peoples R China

[3] Aston Univ, Sch Engn & Appl Sci, Birmingham B4 7ET, W Midlands, England

来源：

BMC SYSTEMS BIOLOGY | 2015年 / 9卷

基金：

中国国家自然科学基金;

关键词：

REMOTE HOMOLOGY DETECTION; SEQUENCE-BASED PREDICTOR; AMINO-ACID-COMPOSITION; RNA-BINDING; WEB SERVER; IDENTIFICATION; GENOME; SITES; CLASSIFIER; PROTOCOL;

D O I：

10.1186/1752-0509-9-S1-S10

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Background:: DNA-binding proteins play a pivotal role in various intra-and extra-cellular activities ranging from DNA replication to gene expression control. Identification of DNA-binding proteins is one of the major challenges in the field of genome annotation. There have been several computational methods proposed in the literature to deal with the DNA-binding protein identification. However, most of them can't provide an invaluable knowledge base for our understanding of DNA-protein interactions. Results:: We firstly presented a new protein sequence encoding method called PSSM Distance Transformation, and then constructed a DNA-binding protein identification method (SVM-PSSM-DT) by combining PSSM Distance Transformation with support vector machine (SVM). First, the PSSM profiles are generated by using the PSI-BLAST program to search the non-redundant (NR) database. Next, the PSSM profiles are transformed into uniform numeric representations appropriately by distance transformation scheme. Lastly, the resulting uniform numeric representations are inputted into a SVM classifier for prediction. Thus whether a sequence can bind to DNA or not can be determined. In benchmark test on 525 DNA-binding and 550 non DNA-binding proteins using jackknife validation, the present model achieved an ACC of 79.96%, MCC of 0.622 and AUC of 86.50%. This performance is considerably better than most of the existing state-of-the-art predictive methods. When tested on a recently constructed independent dataset PDB186, SVM-PSSM-DT also achieved the best performance with ACC of 80.00%, MCC of 0.647 and AUC of 87.40%, and outperformed some existing state-of-the-art methods. Conclusions:: The experiment results demonstrate that PSSM Distance Transformation is an available protein sequence encoding method and SVM-PSSM-DT is a useful tool for identifying the DNA-binding proteins. A user-friendly web-server of SVM-PSSM-DT was constructed, which is freely accessible to the public at the web-site on http://bioinformatics.hitsz.edu.cn/PSSM-DT/.

引用

页数：12

共 78 条

[1] Moment-based prediction of DNA-binding proteins [J].

Ahmad, S ;

Sarai, A .

JOURNAL OF MOLECULAR BIOLOGY, 2004, 341 (01) :65-71

[2] Prediction of mono- and di-nucleotide-specific DNA-binding sites in proteins using neural networks [J].

Andrabi, Munazah ;

Mizuguchi, Kenji ;

Sarai, Akinori ;

Ahmad, Shandar .

BMC STRUCTURAL BIOLOGY, 2009, 9

[3] Kernel-based machine learning protocol for predicting DNA-binding proteins [J].

Bhardwaj, N ;

Langlois, RE ;

Zhao, GJ ;

Lu, H .

NUCLEIC ACIDS RESEARCH, 2005, 33 (20) :6486-6493

[4] Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information [J].

Biswas, Ashis Kumer ;

Noman, Nasimul ;

Sikder, Abdur Rahman .

BMC BIOINFORMATICS, 2010, 11

[5] Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology [J].

Brown, J. B. ;

Akutsu, Tatsuya .

BMC BIOINFORMATICS, 2009, 10

[6] Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence [J].

Cai, YD ;

Lin, SL .

BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS, 2003, 1648 (1-2) :127-133

[7] iTIS-PseTNC: A sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition [J].

Chen, Wei ;

Feng, Peng-Mian ;

Deng, En-Ze ;

Lin, Hao ;

Chou, Kuo-Chen .

ANALYTICAL BIOCHEMISTRY, 2014, 462 :76-83

[8] iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition [J].

Chen, Wei ;

Feng, Peng-Mian ;

Lin, Hao ;

Chou, Kuo-Chen .

NUCLEIC ACIDS RESEARCH, 2013, 41 (06) :e68

[9] iNuc-PhysChem: A Sequence-Based Predictor for Identifying Nucleosomes via Physicochemical Properties [J].

Chen, Wei ;

Lin, Hao ;

Feng, Peng-Mian ;

Ding, Chen ;

Zuo, Yong-Chun ;

Chou, Kuo-Chen .

PLOS ONE, 2012, 7 (10)

[10] Some remarks on protein attribute prediction and pseudo amino acid composition [J].

Chou, Kuo-Chen .

JOURNAL OF THEORETICAL BIOLOGY, 2011, 273 (01) :236-247

← 1 2 3 4 5 6 7 8 →