Improving protein fold recognition by random forest

被引:51
作者
Jo, Taeho [1 ]
Cheng, Jianlin [1 ]
机构
[1] Univ Missouri, C Bond Life Sci Ctr, Inst Informat, Dept Comp Sci, Columbia, MO 65211 USA
来源
BMC BIOINFORMATICS | 2014年 / 15卷
基金
美国国家卫生研究院;
关键词
PROFILE-PROFILE ALIGNMENT; HIDDEN MARKOV-MODELS; PSI-BLAST; SEQUENCE; PREDICTION; CLASSIFICATION; INFORMATION; EVOLUTIONARY; DATABASE; TOOL;
D O I
10.1186/1471-2105-15-S11-S14
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Recognizing the correct structural fold among known template protein structures for a target protein (i.e. fold recognition) is essential for template-based protein structure modeling. Since the fold recognition problem can be defined as a binary classification problem of predicting whether or not the unknown fold of a target protein is similar to an already known template protein structure in a library, machine learning methods have been effectively applied to tackle this problem. In our work, we developed RF-Fold that uses random forest - one of the most powerful and scalable machine learning classification methods - to recognize protein folds. Results: RF-Fold consists of hundreds of decision trees that can be trained efficiently on very large datasets to make accurate predictions on a highly imbalanced dataset. We evaluated RF-Fold on the standard Lindahl's benchmark dataset comprised of 976 x 975 target-template protein pairs through cross-validation. Compared with 17 different fold recognition methods, the performance of RF-Fold is generally comparable to the best performance in fold recognition of different difficulty ranging from the easiest family level, the medium-hard superfamily level, and to the hardest fold level. Based on the top-one template protein ranked by RF-Fold, the correct recognition rate is 84.5%, 63.4%, and 40.8% at family, superfamily, and fold levels, respectively. Based on the top-five template protein folds ranked by RF-Fold, the correct recognition rate increases to 91.5%, 79.3% and 58.3% at family, superfamily, and fold levels. Conclusions: The good performance achieved by the RF-Fold demonstrates the random forest's effectiveness for protein fold recognition.
引用
收藏
页数:7
相关论文
共 48 条
[31]   CATH - a hierarchic classification of protein domain structures [J].
Orengo, CA ;
Michie, AD ;
Jones, S ;
Jones, DT ;
Swindells, MB ;
Thornton, JM .
STRUCTURE, 1997, 5 (08) :1093-1108
[32]  
Peng J, 2009, LECT NOTES COMPUT SC, V5541, P31, DOI 10.1007/978-3-642-02008-7_3
[33]   Random forests as a tool for ecohydrological distribution modelling [J].
Peters, Jan ;
De Baets, Bernard ;
Verhoest, Niko E. C. ;
Samson, Roeland ;
Degroeve, Sven ;
De Becker, Piet ;
Huybrechts, Willy .
ECOLOGICAL MODELLING, 2007, 207 (2-4) :304-318
[34]  
Pollastri G, 2002, Bioinformatics, V18 Suppl 1, pS62
[35]   Prediction of coordination number and relative solvent accessibility in proteins [J].
Pollastri, G ;
Baldi, P ;
Fariselli, P ;
Casadio, R .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2002, 47 (02) :142-153
[36]   Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles [J].
Pollastri, G ;
Przybylski, D ;
Rost, B ;
Baldi, P .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2002, 47 (02) :228-235
[37]   COMPASS: A tool for comparison of multiple protein alignments with assessment of statistical significance [J].
Sadreyev, R ;
Grishin, N .
JOURNAL OF MOLECULAR BIOLOGY, 2003, 326 (01) :317-336
[38]  
Schäffer AA, 1999, BIOINFORMATICS, V15, P1000
[39]  
SCHAPIRE RE, 1990, MACH LEARN, V5, P197, DOI 10.1023/A:1022648800760
[40]   FUGUE: Sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties [J].
Shi, JY ;
Blundell, TL ;
Mizuguchi, K .
JOURNAL OF MOLECULAR BIOLOGY, 2001, 310 (01) :243-257