Stronger Lempel-Ziv Based Compressed Text Indexing

被引：24

作者：

Arroyuelo, Diego ^{[1
]}

Navarro, Gonzalo ^{[2
]}

Sadakane, Kunihiko ^{[3
]}

机构：

[1] Yahoo Res Latin Amer, Santiago, Chile

[2] Univ Chile, Dept Comp Sci, Santiago, Chile

[3] Natl Inst Informat, Principles Informat Res Div, Chiyoda Ku, Tokyo 1018430, Japan

来源：

ALGORITHMICA | 2012年 / 62卷 / 1-2期

关键词：

Text compression; Compressed data structures; Compressed full-text indices; Lempel-Ziv compression; SUFFIX ARRAYS; TREES; SPACE; REPRESENTATIONS; SEQUENCES; ALGORITHM;

D O I：

10.1007/s00453-010-9443-8

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Given a text T[1..u] over an alphabet of size sigma, the full-text search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text searching is that of compressed full-text self-indices, which replace the text with a more space-efficient representation of it, at the same time providing indexed access to the text. Thus, we can provide efficient access within compressed space. The Lempel-Ziv index (LZ-index) of Navarro is a compressed full-text self-index able to represent T using 4uH (k) (T)+o(ulog sigma) bits of space, where H (k) (T) denotes the k-th order empirical entropy of T, for any k=o(log (sigma) u). This space is about four times the compressed text size. The index can locate all the occ occurrences of a pattern P in T in O(m (3)log sigma+(m+occ)log u) worst-case time. Although this index has proven very competitive in practice, the O(m (3)log sigma) term can be excessive for long patterns. Also, the factor 4 in its space complexity makes it larger than other state-of-the-art alternatives. In this paper we present stronger Lempel-Ziv based indices (LZ-indices), improving the overall performance of the original LZ-index. We achieve indices requiring (2+epsilon)uH (k) (T)+o(ulog sigma) bits of space, for any constant epsilon > 0, which makes them the smallest existing LZ-indices. We simultaneously improve the search time to O(m (2)+(m+occ)log u), which makes our indices very competitive with state-of-the-art alternatives. Our indices support displaying any text substring of length a"" in optimal O(a""/log (sigma) u) time. In addition, we show how the space can be squeezed to (1+epsilon)uH (k) (T)+o(ulog sigma) to obtain a structure with O(m (2)) average search time for ma parts per thousand yen2log (sigma) u. Alternatively, the search time of LZ-indices can be improved to O((m+occ)log u) with (3+epsilon)uH (k) (T)+o(ulog sigma) bits of space, which is much less than the space needed by other Lempel-Ziv-based indices achieving the same search time. Overall our indices stand out as a very attractive alternative for space-efficient indexed text searching.

引用

页码：54 / 101

页数：48

共 48 条

[1]

[Anonymous], PIZZA CHILI CORPUS C

[2]

APOSTOLICO A, 1985, NATO ISI SERIES, V1, P85

[3]

Arroyuelo D, 2005, LECT NOTES COMPUT SC, V3827, P1143, DOI 10.1007/11602613_113

[4]

ARROYUELO D, 2009, TRDCC20092 U CHIL

[5]

ARROYUELO D, 2008, TRDCC20089 U CHIL

[6]

ARROYUELO D, 2006, LNCS, V4009, P319

[7]

Arroyuelo D, 2007, LECT NOTES COMPUT SC, V4580, P83

[8]

Barbay J, 2007, PROCEEDINGS OF THE EIGHTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P680

[9] Representing trees of higher degree [J].

Benoit, D ;

Demaine, ED ;

Munro, JI ;

Raman, R ;

Raman, V ;

Rao, SS .

ALGORITHMICA, 2005, 43 (04) :275-292

[10]

Burrows M, 1994, BLOCK SORTING LOSSLE

← 1 2 3 4 5 →