Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure

被引:74
作者
Lewis, Darrin P.
Jebara, Tony
Noble, William Stafford [1 ]
机构
[1] Univ Washington, Dept Comp Sci & Engn, Dept Genome Sci, Seattle, WA 98195 USA
[2] Columbia Univ, Dept Comp Sci, New York, NY 10027 USA
关键词
D O I
10.1093/bioinformatics/btl475
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Drawing inferences from large, heterogeneous sets of biological data requires a theoretical framework that is capable of representing, e. g. DNA and protein sequences, protein structures, microarray expression data, various types of interaction networks, etc. Recently, a class of algorithms known as kernel methods has emerged as a powerful framework for combining diverse types of data. The support vector machine (SVM) algorithm is the most popular kernel method, due to its theoretical underpinnings and strong empirical performance on a wide variety of classification tasks. Furthermore, several recently described extensions allow the SVM to assign relative weights to various datasets, depending upon their utilities in performing a given classification task. Results: In this work, we empirically investigate the performance of the SVM on the task of inferring gene functional annotations from a combination of protein sequence and structure data. Our results suggest that the SVM is quite robust to noise in the input datasets. Consequently, in the presence of only two types of data, an SVM trained from an unweighted combination of datasets performs as well or better than a more sophisticated algorithm that assigns weights to individual data types. Indeed, for this simple case, we can demonstrate empirically that no solution is significantly better than the naive, unweighted average of the two datasets. On the other hand, when multiple noisy datasets are included in the experiment, then the naive approach fares worse than the weighted approach. Our results suggest that for many applications, a naive unweighted sum of kernels may be sufficient. Availability: http://noble.gs.washington.edu/proj/seqstruct Contact: noble@gs.washington.edu Supplementary information: Supplementary Data are available at Bioinformatics online.
引用
收藏
页码:2753 / 2760
页数:8
相关论文
共 31 条
[1]  
[Anonymous], 2002, Proc. of the Intl. Conf. on Research in Computational Molecular Biology
[2]  
Bach F., 2004, ADV NEURAL INFORM PR
[3]  
Bach F. R., 2004, P 21 INT C MACH LEAR, P6
[4]   Kernel methods for predicting protein-protein interactions [J].
Ben-Hur, A ;
Noble, WS .
BIOINFORMATICS, 2005, 21 :I38-I46
[5]   Protective effect of diphenyl diselenide on acute liver damage induced by 2-nitropropane in rats [J].
Borges, LP ;
Borges, VC ;
Moro, AV ;
Nogueira, CW ;
Rocha, JBT ;
Zeni, G .
TOXICOLOGY, 2005, 210 (01) :1-8
[6]  
Boser B. E., 1992, Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, P144, DOI 10.1145/130385.130401
[7]  
Cristianini N., 2000, Intelligent Data Analysis: An Introduction, DOI 10.1017/CBO9780511801389
[8]  
*GEN ONT CONS, 2000, NAT GENET, V250, P25
[9]   PROTEIN-STRUCTURE COMPARISON BY ALIGNMENT OF DISTANCE MATRICES [J].
HOLM, L ;
SANDER, C .
JOURNAL OF MOLECULAR BIOLOGY, 1993, 233 (01) :123-138
[10]  
Jebara T, 2004, P 21 INT C MACH LEAR, P55