The Dyck Language Edit Distance Problem in Near-linear Time

被引:21
作者
Saha, Barna [1 ]
机构
[1] Univ Massachusetts, Dept Comp Sci, Amherst, MA 01003 USA
来源
2014 55TH ANNUAL IEEE SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS 2014) | 2014年
关键词
edit distance; formal language; linear time algorithm design; approximation algorithms;
D O I
10.1109/FOCS.2014.71
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Given a string sigma over alphabet Sigma and a grammar G defined over the same alphabet, how many minimum number of repairs (insertions, deletions and substitutions) are required to map sigma into a valid member of G? The seminal work of Aho and Peterson in 1972 initiated the study of this language edit distance problem providing a dynamic programming algorithm for context free languages that runs in O(vertical bar G vertical bar(2)n(3)) time, where n is the string length and vertical bar G vertical bar is the grammar size. While later improvements reduced the running time to O(vertical bar G vertical bar(2)n(3)), the cubic time complexity on the input length held a major bottleneck for applying these algorithms to their multitude of applications. In this paper, we study the language edit distance problem for a fundamental context free language, DYCK(s) representing the language of well-balanced parentheses of s different types, that has been pivotal in the development of formal language theory. We provide the very first near-linear time algorithm to tightly approximate the DYCK(s) language edit distance problem for any arbitrary s. DYCK(s) language edit distance significantly generalizes the well-studied string edit distance problem, and appears in most applications of language edit distance ranging from data quality in databases, generating automated error-correcting parsers in compiler optimization to structure prediction problems in biological sequences. Its nondeterministic counterpart is known as the hardest context free language. Our main result is an algorithm for edit distance computation to DYCK(s) for any positive integer s that runs in O(n(1+epsilon)polylog(n)) time and achieves an approximation factor of O(1/epsilon beta(n) log vertical bar OPT vertical bar), for any epsilon > 0. Here OPT is the optimal edit distance to DYCK(s) and beta(n) is the best approximation factor known for the simpler problem of string edit distance running in analogous time. If we allow O(n(1+epsilon)vertical bar vertical bar OPT vertical bar(2)n(epsilon)) time, then the approximation factor can be reduced to O(1/epsilon log vertical bar OPT vertical bar). Since the best known near-linear time algorithm for the string edit distance problem has beta(n) = polylog(n), under near-linear time computation model both DYCK(s) language and string edit distance problems have polylog(n) approximation factors. This comes as a surprise since the former is a significant generalization of the latter and their exact computations via dynamic programming show a stark difference in time complexity. Rather less surprisingly, we show that the framework for efficiently approximating edit distance to DYCK(s) can be utilized for many other languages. We illustrate this by considering various memory checking languages (studied extensively under distributed verification) such as STACK, QUEUE, PQ and DEQUE which comprise of valid transcripts of stacks, queues, priority queues and double-ended queues respectively. Therefore, any language that can be recognized by these data structures, can also be repaired efficiently by our algorithm.
引用
收藏
页码:611 / 620
页数:10
相关论文
共 32 条
[1]  
Aho Alfred V., 1972, SIAM J COMPUT, V1
[2]  
Ajtai M., 2002, STOC
[3]   Regular languages are testable with a constant number of queries [J].
Alon, N ;
Krivelevich, M ;
Newman, I ;
Szegedy, M .
SIAM JOURNAL ON COMPUTING, 2001, 30 (06) :1842-1862
[4]  
Andoni A., 2010, FOCS
[5]  
Andoni A, 2009, ACM S THEORY COMPUT, P199
[6]  
Bar-Yossef Ziv, 2004, FOCS
[7]  
Batu Tugkan, 2006, SODA
[8]  
Blum M., 1991, FOCS
[9]  
Chakrabarti Amit, 2010, FOCS
[10]  
Chomsky N, 1963, ALGEBRAIC THEORY CON