The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches

被引:13
作者
Khan, Ishita K. [1 ]
Wei, Qing [1 ]
Chapman, Samuel [3 ]
Kc, Dukka B. [3 ]
Kihara, Daisuke [1 ,2 ]
机构
[1] Purdue Univ, Dept Comp Sci, W Lafayette, IN 47907 USA
[2] Purdue Univ, Dept Biol Sci, W Lafayette, IN 47907 USA
[3] North Carolina A&T State Univ, Dept Computat Sci & Engn, Greensboro, NC 27411 USA
基金
新加坡国家研究基金会; 美国国家科学基金会; 美国国家卫生研究院;
关键词
Protein function; sequence; CAFA; function prediction; PFP; ESG; consensus method; ensemble method; gene annotation; BINDING LIGAND PREDICTION; GENE-EXPRESSION DATA; ESCHERICHIA-COLI; SEQUENCE; ONTOLOGY; ANNOTATIONS; INFERENCE; SYSTEM; FAMILY; TOOLS;
D O I
10.1186/s13742-015-0083-4
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: Functional annotation of novel proteins is one of the central problems in bioinformatics. With the ever-increasing development of genome sequencing technologies, more and more sequence information is becoming available to analyze and annotate. To achieve fast and automatic function annotation, many computational (automated) function prediction (AFP) methods have been developed. To objectively evaluate the performance of such methods on a large scale, community-wide assessment experiments have been conducted. The second round of the Critical Assessment of Function Annotation (CAFA) experiment was held in 2013-2014. Evaluation of participating groups was reported in a special interest group meeting at the Intelligent Systems in Molecular Biology (ISMB) conference in Boston in 2014. Our group participated in both CAFA1 and CAFA2 using multiple, in-house AFP methods. Here, we report benchmark results of our methods obtained in the course of preparation for CAFA2 prior to submitting function predictions for CAFA2 targets. Results: For CAFA2, we updated the annotation databases used by our methods, protein function prediction (PFP) and extended similarity group (ESG), and benchmarked their function prediction performances using the original (older) and updated databases. Performance evaluation for PFP with different settings and ESG are discussed. We also developed two ensemble methods that combine function predictions from six independent, sequence-based AFP methods. We further analyzed the performances of our prediction methods by enriching the predictions with prior distribution of gene ontology (GO) terms. Examples of predictions by the ensemble methods are discussed. Conclusions: Updating the annotation database was successful, improving the Fmax prediction accuracy score for both PFP and ESG. Adding the prior distribution of GO terms did not make much improvement. Both of the ensemble methods we developed improved the average Fmax score over all individual component methods except for ESG. Our benchmark results will not only complement the overall assessment that will be done by the CAFA organizers, but also help elucidate the predictive powers of sequence-based function prediction methods in general.
引用
收藏
页数:14
相关论文
共 60 条
[11]   Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network [J].
Christine Brun ;
François Chevenet ;
David Martin ;
Jérôme Wojcik ;
Alain Guénoche ;
Bernard Jacq .
Genome Biology, 5 (1)
[12]   Real-time ligand binding pocket database search using local surface descriptors [J].
Chikhi, Rayan ;
Sael, Lee ;
Kihara, Daisuke .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2010, 78 (09) :2007-2028
[13]   In-depth performance evaluation of PFP and ESG sequence-based function prediction methods in CAFA 2011 experiment [J].
Chitale, Meghana ;
Khan, Ishita K. ;
Kihara, Daisuke .
BMC BIOINFORMATICS, 2013, 14
[14]   ESG: extended similarity group method for automated protein function prediction [J].
Chitale, Meghana ;
Hawkins, Troy ;
Park, Changsoon ;
Kihara, Daisuke .
BIOINFORMATICS, 2009, 25 (14) :1739-1745
[15]   Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions [J].
Chua, Hon Nian ;
Sung, Wing-Kin ;
Wong, Limsoon .
BIOINFORMATICS, 2006, 22 (13) :1623-1630
[16]  
DARI L, 1991, J BIOL CHEM, V266, P23953
[17]   Mapping gene ontology to proteins based on protein-protein interaction data [J].
Deng, MH ;
Tu, ZD ;
Sun, FZ ;
Chen, T .
BIOINFORMATICS, 2004, 20 (06) :895-902
[18]   Cluster analysis and display of genome-wide expression patterns [J].
Eisen, MB ;
Spellman, PT ;
Brown, PO ;
Botstein, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (25) :14863-14868
[19]   Protein molecular function prediction by Bayesian phylogenomics [J].
Engelhardt, BE ;
Jordan, MI ;
Muratore, KE ;
Brenner, SE .
PLOS COMPUTATIONAL BIOLOGY, 2005, 1 (05) :432-445
[20]  
Galperin M Y, 1998, In Silico Biol, V1, P55