Using Amino Acid Physicochemical Distance Transformation for Fast Protein Remote Homology Detection

被引:91
作者
Liu, Bin [1 ,2 ]
Wang, Xiaolong [1 ,2 ]
Chen, Qingcai [1 ,2 ]
Dong, Qiwen [3 ]
Lan, Xun [4 ]
机构
[1] Harbin Inst Technol, Sch Comp Sci & Technol, Shenzhen Grad Sch, Shenzhen, Guangdong, Peoples R China
[2] Harbin Inst Technol, Key Lab Network Oriented Intelligent Computat, Shenzhen Grad Sch, Shenzhen, Guangdong, Peoples R China
[3] Fudan Univ, Sch Comp Sci, Shanghai 200433, Peoples R China
[4] Ohio State Univ, Dept Biomed Informat, Columbus, OH 43210 USA
基金
中国国家自然科学基金;
关键词
SUPPORT VECTOR MACHINES; DISCRIMINATIVE METHOD; FOLD RECOGNITION; STRUCTURAL CLASS; STRING KERNELS; PREDICTION; SERVER; SIMILARITY; ACCURATE; DATABASE;
D O I
10.1371/journal.pone.0046633
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Protein remote homology detection is one of the most important problems in bioinformatics. Discriminative methods such as support vector machines (SVM) have shown superior performance. However, the performance of SVM-based methods depends on the vector representations of the protein sequences. Prior works have demonstrated that sequence-order effects are relevant for discrimination, but little work has explored how to incorporate the sequence-order information along with the amino acid physicochemical properties into the prediction. In order to incorporate the sequence-order effects into the protein remote homology detection, the physicochemical distance transformation (PDT) method is proposed. Each protein sequence is converted into a series of numbers by using the physicochemical property scores in the amino acid index (AAIndex), and then the sequence is converted into a fixed length vector by PDT. The sequence-order information can be efficiently included into the feature vector with little computational cost by this approach. Finally, the feature vectors are input into a support vector machine classifier to detect the protein remote homologies. Our experiments on a well-known benchmark show the proposed method SVM-PDT achieves superior or comparable performance with current state-of-the-art methods and its computational cost is considerably superior to those of other methods. When the evolutionary information extracted from the frequency profiles is combined with the PDT method, the profile-based PDT approach can improve the performance by 3.4% and 11.4% in terms of ROC score and ROC50 score respectively. The local sequence-order information of the protein can be efficiently captured by the proposed PDT and the physicochemical properties extracted from the amino acid index are incorporated into the prediction. The physicochemical distance transformation provides a general framework, which would be a valuable tool for protein-level study.
引用
收藏
页数:10
相关论文
共 49 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   Remote homology detection: a motif based approach [J].
Ben-Hur, Asa ;
Brutlag, Douglas .
BIOINFORMATICS, 2003, 19 :i26-i33
[3]   A discriminative method for family-based protein remote homology detection that combines inductive logic programming and propositional models [J].
Bernardes, Juliana S. ;
Carbone, Alessandra ;
Zaverucha, Gerson .
BMC BIOINFORMATICS, 2011, 12
[4]   webPRC: the Profile Comparer for alignment-based searching of public domain databases [J].
Brandt, Bernd W. ;
Heringa, Jaap .
NUCLEIC ACIDS RESEARCH, 2009, 37 :W48-W52
[5]   The ASTRAL compendium for protein structure and sequence analysis [J].
Brenner, SE ;
Koehl, P ;
Levitt, R .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :254-256
[6]   A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation [J].
Dong, Qiwen ;
Zhou, Shuigeng ;
Guan, Jihong .
BIOINFORMATICS, 2009, 25 (20) :2655-2662
[7]  
Dong QW, 2005, PROCEEDINGS OF 2005 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-9, P3363
[8]   Application of latent semantic analysis to protein remote homology detection [J].
Dong, QW ;
Wang, XL ;
Lin, L .
BIOINFORMATICS, 2006, 22 (03) :285-290
[9]   Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching [J].
Gribskov, M ;
Robinson, NL .
COMPUTERS & CHEMISTRY, 1996, 20 (01) :25-33
[10]   Detection of protein fold similarity based on correlation of amino acid properties [J].
Grigoriev, IV ;
Kim, SH .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (25) :14318-14323