Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space

被引:9
作者
Kempa, Dominik [1 ]
Kociumaka, Tomasz [2 ]
机构
[1] SUNY Stony Brook, Dept Comp Sci, Stony Brook, NY 11794 USA
[2] Max Planck Inst Informat, Saarland Informat Campus, Saarbrucken, Germany
来源
2023 IEEE 64TH ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, FOCS | 2023年
关键词
data compression; text indexing; compressed indexing; suffix array;
D O I
10.1109/FOCS57990.2023.00114
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The last two decades have witnessed a dramatic increase in the amount of highly repetitive datasets consisting of sequential data (strings, texts). Processing these massive amounts of data using conventional data structures is infeasible. This fueled the development of compressed text indexes, which efficiently answer various queries on a given text, typically in polylogarithmic time, while occupying space proportional to the compressed representation of the text. There exist numerous structures supporting queries ranging from simple "local" queries, such as random access, through more complex ones, including longest common extension (LCE) queries, to the most powerful queries, such as the suffix array (SA) functionality. Alongside the rich repertoire of queries followed a detailed study of the trade-off between the size and functionality of compressed indexes (see: Navarro; ACM Comput. Surv. 2021). It is widely accepted that this hierarchy of structures tells a simple story: the more powerful the queries, the more space is needed. On the one hand, random access, the most basic query, can be supported using O(delta log n log sigma/delta log n) space (where n is the length of the text, sigma is the alphabet size, and delta is the text's substring complexity), which is known to be the asymptotically smallest space sufficient to represent any string with parameters n, sigma, and delta (Kociumaka, Navarro, and Prezza; IEEE Trans. Inf. Theory 2023). The other end of the hierarchy is occupied by indexes supporting the suffix array queries. The currently smallest one takes O(r log n/r) space, where r >= delta is the number of runs in the Burrows-Wheeler Transform of the text (Gagie, Navarro, and Prezza; J. ACM 2020). We present a new compressed index, referred to as delta-SA, that supports the powerful SA functionality and needs only O(delta log n log sigma/delta log n) space. This collapses the hierarchy of compressed data structures into a single point: The space required to represent the text is simultaneously sufficient to efficiently support the full SA functionality. Since suffix array queries are the most widely utilized queries in string processing and data compression, our result immediately improves the space complexity of dozens of algorithms, which can now be executed in delta-optimal compressed space. The delta-SA supports both suffix array and inverse suffix array queries in O(log(4+epsilon) n) time (where epsilon > 0 is any predefined constant). Our second main result is an O(delta polylog n)-time construction of the delta-SA from the Lempel-Ziv (LZ77) parsing of the text. This is the first algorithm that builds an SA index in compressed time, i.e., time nearly linear in the compressed input size. For highly repetitive texts, this is up to exponentially faster than the previously best algorithm, which builds an O(r log n/r)-size index in O(root delta n poly log n) time. To obtain our results, we develop numerous new techniques of independent interest. This includes deterministic restricted recompression, delta-compressed string synchronizing sets, and their construction in compressed time. We also improve many other auxiliary data structures; e.g., we show the first O(delta log n log sigma/delta log n)size index for LCE queries along with its efficient construction from the LZ77 parsing.
引用
收藏
页码:1877 / 1886
页数:10
相关论文
共 77 条
  • [1] Abboud A., 2020, 34 C NEUR INF PROC S, V33, P8810
  • [2] Fine-Grained Complexity of Analyzing Compressed Data: Quantifying Improvements over Decompress-And-Solve
    Abboud, Amir
    Backurs, Arturs
    Bringmann, Karl
    Kuennemann, Marvin
    [J]. 2017 IEEE 58TH ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS), 2017, : 192 - 203
  • [3] Adjeroh Donald, 2008, The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching, DOI DOI 10.1007/978-0-387-78909-5
  • [4] Akmal S, 2022, PROCEEDINGS OF THE 2022 ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, SODA, P2791
  • [5] Block trees
    Belazzougui, Djamal
    Caceres, Manuel
    Gagie, Travis
    Gawrychowski, Pawel
    Kaerkkaeinen, Juha
    Navarro, Gonzalo
    Ordonez, Alberto
    Puglisi, Simon J.
    Tabei, Yasuo
    [J]. JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 2021, 117 : 1 - 22
  • [6] Linear time construction of compressed text indices in compact space
    Belazzougui, Djamal
    [J]. STOC'14: PROCEEDINGS OF THE 46TH ANNUAL 2014 ACM SYMPOSIUM ON THEORY OF COMPUTING, 2014, : 184 - 193
  • [7] Alphabet-Independent Compressed Text Indexing
    Belazzougui, Djamal
    Navarro, Gonzalo
    [J]. ACM TRANSACTIONS ON ALGORITHMS, 2014, 10 (04)
  • [8] Computational Biology in the 21st Century: Scaling with Compressive Algorithms
    Berger, Bonnie
    Daniels, Noah M.
    Yu, Y. William
    [J]. COMMUNICATIONS OF THE ACM, 2016, 59 (08) : 72 - 80
  • [9] RANDOM ACCESS TO GRAMMAR-COMPRESSED STRINGS AND TREES
    Bille, Philip
    Landau, Gad M.
    Raman, Rajeev
    Sadakane, Kunihiko
    Satti, Srinivasa Rao
    Weimann, Oren
    [J]. SIAM JOURNAL ON COMPUTING, 2015, 44 (03) : 513 - 539
  • [10] Bringmann K, 2019, Disc Algorithms, P1126