Fast kernels for inexact string matching

被引:17
作者
Leslie, C [1 ]
Kuang, R [1 ]
机构
[1] Columbia Univ, New York, NY 10027 USA
来源
LEARNING THEORY AND KERNEL MACHINES | 2003年 / 2777卷
关键词
kernel methods; string kernels; computational biology;
D O I
10.1007/978-3-540-45167-9_10
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce several new families of string kernels designed in particular for use with support vector machines (SVMs) for classification of protein sequence data. These kernels - restricted gappy kernels, substitution kernels, and wildcard kernels - are based on feature spaces indexed by k-length subsequences from the string alphabet Sigma (or the alphabet augmented by a wildcard character), and hence they are related to the recently presented (k, m)-mismatch kernel and string kernels used in text classification. However, for all kernels we define here, the kernel value K(x, y) can be computed in O (c(K)(\x\ + \y\)) time, where the constant c(K) depends on the parameters of the kernel but is independent of the size \Sigma\ of the alphabet. Thus the computation of these kernels is linear in the length of the sequences, like the mismatch kernel, but we improve upon the parameter-dependent constant c(K) = k(m+1)\Sigma\(m) of the mismatch kernel. We compute the kernels efficiently using a recursive function based on a trie data structure and relate our new kernels to the recently described transducer formalism. Finally, we report protein classification experiments on a benchmark SCOP dataset, where we show that our new faster kernels achieve SVM classification performance comparable to the mismatch kernel and the Fisher kernel derived from profile hidden Markov models.
引用
收藏
页码:114 / 128
页数:15
相关论文
共 15 条
[1]  
ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
[2]  
CORTES C, 2002, NEURAL INFORMATION P
[3]  
Dayhoff M., 1978, ATLAS PROTEIN SEQ ST, V5, P353
[4]  
ESKIN E, 2003, UNIFIED APPROACH SEQ
[5]   Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching [J].
Gribskov, M ;
Robinson, NL .
COMPUTERS & CHEMISTRY, 1996, 20 (01) :25-33
[6]  
Haussler D, 1999, CONVOLUTION KERNELS
[7]   AMINO-ACID SUBSTITUTION MATRICES FROM PROTEIN BLOCKS [J].
HENIKOFF, S ;
HENIKOFF, JG .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1992, 89 (22) :10915-10919
[8]  
Jaakkola T, 1999, Proc Int Conf Intell Syst Mol Biol, P149
[9]  
Leslie C., 2002, P PAC BIOC S
[10]  
LESLIE C, 2002, NEURAL INFORMATION P, V15