Improving protein fold recognition by random forest

被引：51

作者：

Jo, Taeho ^{[1
]}

Cheng, Jianlin ^{[1
]}

机构：

[1] Univ Missouri, C Bond Life Sci Ctr, Inst Informat, Dept Comp Sci, Columbia, MO 65211 USA

来源：

BMC BIOINFORMATICS | 2014年 / 15卷

基金：

美国国家卫生研究院;

关键词：

PROFILE-PROFILE ALIGNMENT; HIDDEN MARKOV-MODELS; PSI-BLAST; SEQUENCE; PREDICTION; CLASSIFICATION; INFORMATION; EVOLUTIONARY; DATABASE; TOOL;

D O I：

10.1186/1471-2105-15-S11-S14

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: Recognizing the correct structural fold among known template protein structures for a target protein (i.e. fold recognition) is essential for template-based protein structure modeling. Since the fold recognition problem can be defined as a binary classification problem of predicting whether or not the unknown fold of a target protein is similar to an already known template protein structure in a library, machine learning methods have been effectively applied to tackle this problem. In our work, we developed RF-Fold that uses random forest - one of the most powerful and scalable machine learning classification methods - to recognize protein folds. Results: RF-Fold consists of hundreds of decision trees that can be trained efficiently on very large datasets to make accurate predictions on a highly imbalanced dataset. We evaluated RF-Fold on the standard Lindahl's benchmark dataset comprised of 976 x 975 target-template protein pairs through cross-validation. Compared with 17 different fold recognition methods, the performance of RF-Fold is generally comparable to the best performance in fold recognition of different difficulty ranging from the easiest family level, the medium-hard superfamily level, and to the hardest fold level. Based on the top-one template protein ranked by RF-Fold, the correct recognition rate is 84.5%, 63.4%, and 40.8% at family, superfamily, and fold levels, respectively. Based on the top-five template protein folds ranked by RF-Fold, the correct recognition rate increases to 91.5%, 79.3% and 58.3% at family, superfamily, and fold levels. Conclusions: The good performance achieved by the RF-Fold demonstrates the random forest's effectiveness for protein fold recognition.

引用

页数：7

共 48 条

[31] CATH - a hierarchic classification of protein domain structures [J].

Orengo, CA ;

Michie, AD ;

Jones, S ;

Jones, DT ;

Swindells, MB ;

Thornton, JM .

STRUCTURE, 1997, 5 (08) :1093-1108

[32]

Peng J, 2009, LECT NOTES COMPUT SC, V5541, P31, DOI 10.1007/978-3-642-02008-7_3

[33] Random forests as a tool for ecohydrological distribution modelling [J].

Peters, Jan ;

De Baets, Bernard ;

Verhoest, Niko E. C. ;

Samson, Roeland ;

Degroeve, Sven ;

De Becker, Piet ;

Huybrechts, Willy .

ECOLOGICAL MODELLING, 2007, 207 (2-4) :304-318

[34]

Pollastri G, 2002, Bioinformatics, V18 Suppl 1, pS62

[35] Prediction of coordination number and relative solvent accessibility in proteins [J].

Pollastri, G ;

Baldi, P ;

Fariselli, P ;

Casadio, R .

PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2002, 47 (02) :142-153

[36] Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles [J].

Pollastri, G ;

Przybylski, D ;

Rost, B ;

Baldi, P .

PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2002, 47 (02) :228-235

[37] COMPASS: A tool for comparison of multiple protein alignments with assessment of statistical significance [J].

Sadreyev, R ;

Grishin, N .

JOURNAL OF MOLECULAR BIOLOGY, 2003, 326 (01) :317-336

[38]

Schäffer AA, 1999, BIOINFORMATICS, V15, P1000

[39]

SCHAPIRE RE, 1990, MACH LEARN, V5, P197, DOI 10.1023/A:1022648800760

[40] FUGUE: Sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties [J].

Shi, JY ;

Blundell, TL ;

Mizuguchi, K .

JOURNAL OF MOLECULAR BIOLOGY, 2001, 310 (01) :243-257

← 1 2 3 4 5 →