Stronger Lempel-Ziv Based Compressed Text Indexing

被引:24
作者
Arroyuelo, Diego [1 ]
Navarro, Gonzalo [2 ]
Sadakane, Kunihiko [3 ]
机构
[1] Yahoo Res Latin Amer, Santiago, Chile
[2] Univ Chile, Dept Comp Sci, Santiago, Chile
[3] Natl Inst Informat, Principles Informat Res Div, Chiyoda Ku, Tokyo 1018430, Japan
关键词
Text compression; Compressed data structures; Compressed full-text indices; Lempel-Ziv compression; SUFFIX ARRAYS; TREES; SPACE; REPRESENTATIONS; SEQUENCES; ALGORITHM;
D O I
10.1007/s00453-010-9443-8
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Given a text T[1..u] over an alphabet of size sigma, the full-text search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text searching is that of compressed full-text self-indices, which replace the text with a more space-efficient representation of it, at the same time providing indexed access to the text. Thus, we can provide efficient access within compressed space. The Lempel-Ziv index (LZ-index) of Navarro is a compressed full-text self-index able to represent T using 4uH (k) (T)+o(ulog sigma) bits of space, where H (k) (T) denotes the k-th order empirical entropy of T, for any k=o(log (sigma) u). This space is about four times the compressed text size. The index can locate all the occ occurrences of a pattern P in T in O(m (3)log sigma+(m+occ)log u) worst-case time. Although this index has proven very competitive in practice, the O(m (3)log sigma) term can be excessive for long patterns. Also, the factor 4 in its space complexity makes it larger than other state-of-the-art alternatives. In this paper we present stronger Lempel-Ziv based indices (LZ-indices), improving the overall performance of the original LZ-index. We achieve indices requiring (2+epsilon)uH (k) (T)+o(ulog sigma) bits of space, for any constant epsilon > 0, which makes them the smallest existing LZ-indices. We simultaneously improve the search time to O(m (2)+(m+occ)log u), which makes our indices very competitive with state-of-the-art alternatives. Our indices support displaying any text substring of length a"" in optimal O(a""/log (sigma) u) time. In addition, we show how the space can be squeezed to (1+epsilon)uH (k) (T)+o(ulog sigma) to obtain a structure with O(m (2)) average search time for ma parts per thousand yen2log (sigma) u. Alternatively, the search time of LZ-indices can be improved to O((m+occ)log u) with (3+epsilon)uH (k) (T)+o(ulog sigma) bits of space, which is much less than the space needed by other Lempel-Ziv-based indices achieving the same search time. Overall our indices stand out as a very attractive alternative for space-efficient indexed text searching.
引用
收藏
页码:54 / 101
页数:48
相关论文
共 48 条
[1]  
[Anonymous], PIZZA CHILI CORPUS C
[2]  
APOSTOLICO A, 1985, NATO ISI SERIES, V1, P85
[3]  
Arroyuelo D, 2005, LECT NOTES COMPUT SC, V3827, P1143, DOI 10.1007/11602613_113
[4]  
ARROYUELO D, 2009, TRDCC20092 U CHIL
[5]  
ARROYUELO D, 2008, TRDCC20089 U CHIL
[6]  
ARROYUELO D, 2006, LNCS, V4009, P319
[7]  
Arroyuelo D, 2007, LECT NOTES COMPUT SC, V4580, P83
[8]  
Barbay J, 2007, PROCEEDINGS OF THE EIGHTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P680
[9]   Representing trees of higher degree [J].
Benoit, D ;
Demaine, ED ;
Munro, JI ;
Raman, R ;
Raman, V ;
Rao, SS .
ALGORITHMICA, 2005, 43 (04) :275-292
[10]  
Burrows M, 1994, BLOCK SORTING LOSSLE