How Compression and Approximation Affect Efficiency in String Distance Measures

被引:0
作者
Ganesh, Arun [1 ]
Kociurnaka, Tornasz [1 ]
Lincoln, Andrea [1 ]
Saha, Barna [1 ]
机构
[1] Univ Calif Berkeley, Berkeley, CA 94720 USA
来源
PROCEEDINGS OF THE 2022 ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, SODA | 2022年
关键词
EDIT DISTANCE; SEQUENCES; ALGORITHM;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Real-world data often comes in compressed form. Analyzing compressed data directly (without first decompressing it) can save space and time by orders of magnitude. In this work, we focus on fundamental sequence comparison problems and try to quantify the gain in time complexity when the underlying data is highly compressible. We consider grammar compression, which unifies many practically relevant compression schemes such as the Lempel-Ziv family, dictionary methods, and others. For two strings of total length N and total compressed size n, it is known that the edit distance and a longest common subsequence (LCS) can be computed exactly in time (O) over tilde (nN), as opposed to O(N-2) for the uncompressed setting. Many real-world applications need to align multiple sequences simultaneously, and the fastest known exact algorithms for median edit distance and LCS of k strings run in O (N k) time, whereas the one for center edit distance has a time complexity of O(N-2k). This naturally raises the question if compression can help to reduce the running time significantly for k >= 3, perhaps to O (N-k/2 n(k/2)) or, more optimistically, to O (Nnk-1).(1) Unfortunately, we show new lower bounds that rule out any improvement beyond Omega(Nk-1 n) time for any of these problems assuming the Strong Exponential Time Hypothesis (SETH), where again N and n represent the total length and the total compressed size, respectively. This answers an open question of Abboud, Backurs, Bringmann, and Kunnemann (FOCS'17). In presence of such negative results, we ask if allowing approximation can help, and we show that approximation and compression together can be surprisingly effective for both multiple and two strings. We develop an (O) over tilde (N(k/2)n(k/2))-time FPTAS for the median edit distance of k sequences, leading to a saving of nearly half the dimensions for highly-compressible sequences. In comparison, no O(N-k)-time PTAS is known for the median edit distance problem in the uncompressed setting. We obtain an improvement from (O) over tilde (N-2k) to (O) over tilde (Nk/2+o(k) n(k/2)) for the center edit distance problem. For two strings, we get an (O) over tilde (N-2/3 n(4/3))-time FPTAS for both edit distance and LCS; note that this running time is o(N) whenever n << N-1/4. In contrast, for uncompressed strings, there is not even a subquadratic algorithm for LCS that has less than polynomial gap in the approximation factor. Building on the insight from our approximation algorithms, we also obtain several new and improved results for many fundamental distance measures including the edit, Hamming, and shift distances.
引用
收藏
页码:2867 / 2919
页数:53
相关论文
共 57 条
[1]   Fine-Grained Complexity of Analyzing Compressed Data: Quantifying Improvements over Decompress-And-Solve [J].
Abboud, Amir ;
Backurs, Arturs ;
Bringmann, Karl ;
Kuennemann, Marvin .
2017 IEEE 58TH ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS), 2017, :192-203
[2]   Tight Hardness Results for LCS and other Sequence Similarity Measures [J].
Abboud, Amir ;
Backurs, Arturs ;
Williams, Virginia Vassilevska .
2015 IEEE 56TH ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, 2015, :59-78
[3]   GEOMETRIC APPLICATIONS OF A MATRIX-SEARCHING ALGORITHM [J].
AGGARWAL, A ;
KLAWE, MM ;
MORAN, S ;
SHOR, P ;
WILBER, R .
ALGORITHMICA, 1987, 2 (02) :195-208
[4]   Edit Distance in Near-Linear Time: it's a Constant Factor [J].
Andoni, Alexandr ;
Nosatzki, Negev Shekel .
2020 IEEE 61ST ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS 2020), 2020, :990-1001
[5]  
Andoni A, 2013, STOC'13: PROCEEDINGS OF THE 2013 ACM SYMPOSIUM ON THEORY OF COMPUTING, P931
[6]  
Andoni A, 2013, PROCEEDINGS OF THE TWENTY-FOURTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS (SODA 2013), P457
[7]   APPROXIMATING EDIT DISTANCE IN NEAR-LINEAR TIME [J].
Andoni, Alexandr ;
Onak, Krzysztof .
SIAM JOURNAL ON COMPUTING, 2012, 41 (06) :1635-1648
[8]   Polylogarithmic Approximation for Edit Distance and the Asymmetric Query Complexity [J].
Andoni, Alexandr ;
Krauthgamer, Robert ;
Onak, Krzysztof .
2010 IEEE 51ST ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, 2010, :377-386
[9]  
Andoni A, 2008, PROCEEDINGS OF THE NINETEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P343
[10]  
[Anonymous], 2003, P 30 IFTH ANN ACM S, DOI 10.1145/780542.780590