A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data

被引:86
作者
Yao, ZZ [1 ]
Ruzzo, WL
机构
[1] Univ Washington, Dept Comp Sci & Engn, Paul G Allen Ctr AC101, Seattle, WA 98195 USA
[2] Univ Washington, Dept Genome Sci, Seattle, WA 98195 USA
关键词
D O I
10.1186/1471-2105-7-S1-S11
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: As a variety of functional genomic and proteomic techniques become available, there is an increasing need for functional analysis methodologies that integrate heterogeneous data sources. Methods: In this paper, we address this issue by proposing a general framework for gene function prediction based on the k-nearest-neighbor (KNN) algorithm. The choice of KNN is motivated by its simplicity, flexibility to incorporate different data types and adaptability to irregular feature spaces. A weakness of traditional KNN methods, especially when handling heterogeneous data, is that performance is subject to the often ad hoc choice of similarity metric. To address this weakness, we apply regression methods to infer a similarity metric as a weighted combination of a set of base similarity measures, which helps to locate the neighbors that are most likely to be in the same class as the target gene. We also suggest a novel voting scheme to generate confidence scores that estimate the accuracy of predictions. The method gracefully extends to multi-way classification problems. Results: We apply this technique to gene function prediction according to three well-known Escherichia coli classification schemes suggested by biologists, using information derived from microarray and genome sequencing data. We demonstrate that our algorithm dramatically outperforms the naive KNN methods and is competitive with support vector machine (SVM) algorithms for integrating heterogenous data. We also show that by combining different data sources, prediction accuracy can improve significantly. Conclusion: Our extension of KNN with automatic feature weighting, multi-class prediction, and probabilistic inference, enhance prediction accuracy significantly while remaining efficient, intuitive and flexible. This general framework can also be applied to similar classification problems involving heterogeneous datasets.
引用
收藏
页数:11
相关论文
共 24 条
[1]  
[Anonymous], P 7 INT C COMP MOL B
[2]   Knowledge-based analysis of microarray gene expression data by using support vector machines [J].
Brown, MPS ;
Grundy, WN ;
Lin, D ;
Cristianini, N ;
Sugnet, CW ;
Furey, TS ;
Ares, M ;
Haussler, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (01) :262-267
[3]  
CLEVELAND WS, 1991, LOCAL REGRESSION MOD, pCH8
[4]   First InP/InGaAs PNPHBT grown by metal organic chemical vapor deposition [J].
Cui, DL ;
Hsu, S ;
Pavlidis, D .
2001 INTERNATIONAL CONFERENCE ON INDIUM PHOSPHIDE AND RELATED MATERIALS, CONFERENCE PROCEEDINGS, 2001, :224-227
[5]  
DOBSON AJ, 1990, INTRO GEN LIENAR MOD
[6]   Cluster analysis and display of genome-wide expression patterns [J].
Eisen, MB ;
Spellman, PT ;
Brown, PO ;
Botstein, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (25) :14863-14868
[7]   Protein interaction maps for complete genomes based on gene fusion events [J].
Enright, AJ ;
Iliopoulos, I ;
Kyrpides, NC ;
Ouzounis, CA .
NATURE, 1999, 402 (6757) :86-90
[8]  
Fellenberg M, 2000, Proc Int Conf Intell Syst Mol Biol, V8, P152
[9]   Functional organization of the yeast proteome by systematic analysis of protein complexes [J].
Gavin, AC ;
Bösche, M ;
Krause, R ;
Grandi, P ;
Marzioch, M ;
Bauer, A ;
Schultz, J ;
Rick, JM ;
Michon, AM ;
Cruciat, CM ;
Remor, M ;
Höfert, C ;
Schelder, M ;
Brajenovic, M ;
Ruffner, H ;
Merino, A ;
Klein, K ;
Hudak, M ;
Dickson, D ;
Rudi, T ;
Gnau, V ;
Bauch, A ;
Bastuck, S ;
Huhse, B ;
Leutwein, C ;
Heurtier, MA ;
Copley, RR ;
Edelmann, A ;
Querfurth, E ;
Rybin, V ;
Drewes, G ;
Raida, M ;
Bouwmeester, T ;
Bork, P ;
Seraphin, B ;
Kuster, B ;
Neubauer, G ;
Superti-Furga, G .
NATURE, 2002, 415 (6868) :141-147
[10]   Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae [J].
Ge, H ;
Liu, ZH ;
Church, GM ;
Vidal, M .
NATURE GENETICS, 2001, 29 (04) :482-486