Efficient Algorithms for Substring Near Neighbor Problem

被引:17
作者
Andoni, Alexandr
Indyk, Piotr
机构
来源
PROCEEDINGS OF THE SEVENTHEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS | 2006年
关键词
D O I
10.1145/1109557.1109690
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this paper we consider the problem of finding the approximate nearest neighbor when the data set points are the substrings of a given text T. Specifically, for a string T of length n, we present a data structure which does the following: given a pattern P, if there is a substring of T within the distance R from P, it reports a (possibly different) substring of T within distance cR from P. The length of the pattern P, denoted by m, is not known in advance. For the case where the distances are measured using the Hamming distance, we present a data structure which uses (O) over tilde (n(1+1/c)) space(1) and with (O) over tilde (n(l/c) + mn(o(1))) query time. This essentially matches the earlier bounds of [Ind98], which assumed that the pattern length m is fixed in advance. In addition, our data structure can be constructed in time (O) over tilde (n(1+1/c) + n(1+o(1)) M-1/3), where M is an upper bound for in. This essentially matches the preprocessing bound of [Ind98] as long as the term <(O)over tilde> (n(1+1/c)) dominates the running time, which is the case when, e.g., c < 3. We also extend our results to the case where the distances are measured according to the l(I) distance. The query time and the space bound are essentially the same, while the preprocessing time becomes <(O)over tilde>(n(1+1/c) + n(1+o(1)) M-2/3).
引用
收藏
页码:1203 / 1212
页数:10
相关论文
共 22 条
  • [1] EFFICIENT 2-DIMENSIONAL APPROXIMATE MATCHING OF HALF-RECTANGULAR FIGURES
    AMIR, A
    FARACH, M
    [J]. INFORMATION AND COMPUTATION, 1995, 118 (01) : 1 - 11
  • [2] ANDONI A, 2005, THESIS MIT CAMBRIDGE
  • [3] [Anonymous], P ANN INT C COMP MOL
  • [4] [Anonymous], P ANN INT C COMP MOL
  • [5] [Anonymous], P S THEOR COMP
  • [6] ARYA S, 1994, PROCEEDINGS OF THE FIFTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P573
  • [7] The LCA problem revisited
    Bender, MA
    Farach-Colton, M
    [J]. LATIN 2000: THEORETICAL INFORMATICS, 2000, 1776 : 88 - 94
  • [8] Efficient large-scale sequence comparison by locality-sensitive hashing
    Buhler, J
    [J]. BIOINFORMATICS, 2001, 17 (05) : 419 - 428
  • [9] Cole Richard, 2004, P 36 ANN ACM S THEOR, P91, DOI DOI 10.1145/1007352.1007374
  • [10] Datar M., 2004, P ACM S COMP GEOM