Gene function classification using Bayesian models with hierarchy-based priors

被引:14
作者
Shahbaba, Babak [1 ]
Neal, Radford M.
机构
[1] Univ Toronto, Dept Publ Hlth Sci, Toronto, ON, Canada
[2] Univ Toronto, Dept Stat, Toronto, ON, Canada
[3] Univ Toronto, Dept Comp Sci, Toronto, ON, Canada
关键词
D O I
10.1186/1471-2105-7-448
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: We investigate whether annotation of gene function can be improved using a classification scheme that is aware that functional classes are organized in a hierarchy. The classifiers look at phylogenic descriptors, sequence based attributes, and predicted secondary structure. We discuss three Bayesian models and compare their performance in terms of predictive accuracy. These models are the ordinary multinomial logit (MNL) model, a hierarchical model based on a set of nested MNL models, and an MNL model with a prior that introduces correlations between the parameters for classes that are nearby in the hierarchy. We also provide a new scheme for combining different sources of information. We use these models to predict the functional class of Open Reading Frames (ORFs) from the E. coli genome. Results: The results from all three models show substantial improvement over previous methods, which were based on the C5 decision tree algorithm. The MNL model using a prior based on the hierarchy outperforms both the non-hierarchical MNL model and the nested MNL model. In contrast to previous attempts at combining the three sources of information in this dataset, our new approach to combining data sources produces a higher accuracy rate than applying our models to each data source alone. Conclusion: Together, these results show that gene function can be predicted with higher accuracy than previously achieved, using Bayesian models that incorporate suitable prior information.
引用
收藏
页数:9
相关论文
共 46 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]  
[Anonymous], [No title captured]
[3]  
[Anonymous], 2004, INT C MACH LEARN
[4]  
[Anonymous], P 21 INT C MACH LEAR
[5]   Hierarchical multi-label prediction of gene function [J].
Barutcuoglu, Z ;
Schapire, RE ;
Troyanskaya, OG .
BIOINFORMATICS, 2006, 22 (07) :830-836
[6]   The complete genome sequence of Escherichia coli K-12 [J].
Blattner, FR ;
Plunkett, G ;
Bloch, CA ;
Perna, NT ;
Burland, V ;
Riley, M ;
ColladoVides, J ;
Glasner, JD ;
Rode, CK ;
Mayhew, GF ;
Gregor, J ;
Davis, NW ;
Kirkpatrick, HA ;
Goeden, MA ;
Rose, DJ ;
Mau, B ;
Shao, Y .
SCIENCE, 1997, 277 (5331) :1453-+
[7]  
Blockeel H., 2002, Proceedings of the SIGKDD Workshop on Multi-Relational Data Mining (MRDM), P21
[8]   Knowledge-based analysis of microarray gene expression data by using support vector machines [J].
Brown, MPS ;
Grundy, WN ;
Lin, D ;
Cristianini, N ;
Sugnet, CW ;
Furey, TS ;
Ares, M ;
Haussler, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (01) :262-267
[9]  
CAI L, 2004, ACM 13 C INF KNOWL M
[10]   Multitask learning [J].
Caruana, R .
MACHINE LEARNING, 1997, 28 (01) :41-75