An upper bound on the hardness of exact matrix based motif discovery

被引：2

作者：

Horton, Paul ^{[1
]}

Fujibuchi, Wataru ^{[1
]}

机构：

[1] AIST, Computat Biol Res Ctr, Tokyo, Japan

来源：

JOURNAL OF DISCRETE ALGORITHMS | 2007年 / 5卷 / 04期

关键词：

Motif discovery; Computational complexity; Combinatorics; Transcription factor binding site prediction; String algorithm;

D O I：

10.1016/j.jda.2006.10.006

中图分类号：

O29 [应用数学];

学科分类号：

070104 ;

摘要：

Motif discovery is the problem of finding local patterns or motifs from a set of unlabeled sequences. One common representation of a motif is a Markov model known as a score matrix. Matrix based motif discovery has been extensively studied but no positive results have been known regarding its theoretical hardness. We present the first non-trivial upper bound on the complexity (worst-case computation time) of this problem. Other than linear terms, our bound depends only on the motif width w (which is typically 5-20) and is a dramatic improvement relative to previously known bounds. We prove this bound by relating the motif discovery problem to a search problem over permutations of strings of length w, in which the permutations have a particular property. We give a constructive proof of an upper bound on the number of such permutations. For an alphabet size of sigma (typically 4) the trivial bound is n! approximate to (n/e)(n), n = sigma(w). Our bound is roughly n(sigma log(sigma) n)(n). We relate this theoretical result to the exact motif discovery program, TsukubaBB, whose algorithm contains ideas which inspired the result. We describe a recent improvement to the TsukubaBB program which can give a speed up of nine or more and use a dataset of REB1 transcription factor binding sites to illustrate that exact methods can indeed be used in some practical situations. (C) 2006 Published by Elsevier B.V.

引用

页码：706 / 713

页数：8

共 14 条

[1]

AKUTSU T, 2000, P 4 ANN INT C COMP M, P1

[2]

BAILEY TL, 1995, MACH LEARN, V21, P51, DOI 10.1007/BF00993379

[3]

Frith M., 2004, NUCLEIC ACIDS RES, P189

[4] Identifying DNA and protein patterns with statistically significant alignments of multiple sequences [J].