An upper bound on the hardness of exact matrix based motif discovery

被引:2
作者
Horton, Paul [1 ]
Fujibuchi, Wataru [1 ]
机构
[1] AIST, Computat Biol Res Ctr, Tokyo, Japan
关键词
Motif discovery; Computational complexity; Combinatorics; Transcription factor binding site prediction; String algorithm;
D O I
10.1016/j.jda.2006.10.006
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Motif discovery is the problem of finding local patterns or motifs from a set of unlabeled sequences. One common representation of a motif is a Markov model known as a score matrix. Matrix based motif discovery has been extensively studied but no positive results have been known regarding its theoretical hardness. We present the first non-trivial upper bound on the complexity (worst-case computation time) of this problem. Other than linear terms, our bound depends only on the motif width w (which is typically 5-20) and is a dramatic improvement relative to previously known bounds. We prove this bound by relating the motif discovery problem to a search problem over permutations of strings of length w, in which the permutations have a particular property. We give a constructive proof of an upper bound on the number of such permutations. For an alphabet size of sigma (typically 4) the trivial bound is n! approximate to (n/e)(n), n = sigma(w). Our bound is roughly n(sigma log(sigma) n)(n). We relate this theoretical result to the exact motif discovery program, TsukubaBB, whose algorithm contains ideas which inspired the result. We describe a recent improvement to the TsukubaBB program which can give a speed up of nine or more and use a dataset of REB1 transcription factor binding sites to illustrate that exact methods can indeed be used in some practical situations. (C) 2006 Published by Elsevier B.V.
引用
收藏
页码:706 / 713
页数:8
相关论文
共 14 条
[1]  
AKUTSU T, 2000, P 4 ANN INT C COMP M, P1
[2]  
BAILEY TL, 1995, MACH LEARN, V21, P51, DOI 10.1007/BF00993379
[3]  
Frith M., 2004, NUCLEIC ACIDS RES, P189
[4]   Identifying DNA and protein patterns with statistically significant alignments of multiple sequences [J].
Hertz, GZ ;
Stormo, GD .
BIOINFORMATICS, 1999, 15 (7-8) :563-577
[5]  
HERTZ GZ, 1990, COMPUT APPL BIOSCI, V6, P81
[6]  
Horton P, 1996, Pac Symp Biocomput, P368
[7]  
Horton P., 2000, P 11 ANN S COMB PATT, P84
[8]  
Horton P., 2001, J COMPUT BIOL, V8, P249
[9]   AN EXPECTATION MAXIMIZATION (EM) ALGORITHM FOR THE IDENTIFICATION AND CHARACTERIZATION OF COMMON SITES IN UNALIGNED BIOPOLYMER SEQUENCES [J].
LAWRENCE, CE ;
REILLY, AA .
PROTEINS-STRUCTURE FUNCTION AND GENETICS, 1990, 7 (01) :41-51
[10]   DETECTING SUBTLE SEQUENCE SIGNALS - A GIBBS SAMPLING STRATEGY FOR MULTIPLE ALIGNMENT [J].
LAWRENCE, CE ;
ALTSCHUL, SF ;
BOGUSKI, MS ;
LIU, JS ;
NEUWALD, AF ;
WOOTTON, JC .
SCIENCE, 1993, 262 (5131) :208-214