Compressed Indexes for Dynamic Text Collections

被引:59
作者
Chan, Ho-Leung [1 ]
Hon, Wing-Kai [2 ]
Lam, Tak-Wah [1 ]
Sadakane, Kunihiko [3 ]
机构
[1] Univ Hong Kong, Dept Comp Sci, Hong Kong, Peoples R China
[2] Natl Tsing Hua Univ, Dept Comp Sci, Hsinchu, Taiwan
[3] Kyushu Univ, Dept Comp Sci & Commun Engn, Fukuoka, Japan
关键词
Compressed suffix tree; string matching;
D O I
10.1145/1240233.1240244
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Let T be a string with n characters over an alphabet of constant size. A recent breakthrough on compressed indexing allows us to build an index for T in optimal space (i.e., O(n) bits), while supporting very efficient pattern matching [Ferragina and Manzini 2000; Grossi and Vitter 2000]. Yet the compressed nature of such indexes also makes them difficult to update dynamically. This article extends the work on optimal-space indexing to a dynamic collection of texts. Our first result is a compressed solution to the library management problem, where we show an index of O (n) bits for a text collection L of total length n, which can be updated in O(vertical bar T vertical bar log n) time when a text T is inserted or deleted from L; also, the index supports searching the occurrences of any pattern P in all texts in L in O (vertical bar P vertical bar log n + occ log(2) n) time, where occ is the number of occurrences. Our second result is a compressed solution to the dictionary matching problem, where we show an index of O(d) bits for a pattern collection V of total length d, which can be updated in O(vertical bar P vertical bar log(2) d) time when a pattern P is inserted or deleted from V; also, the index supports searching the occurrences of all patterns of V in any text T in O ((vertical bar T vertical bar + occ) log(2) d) time. When compared with the O (d log d)-bit suffix-tree-based solution of Amir et al. [1995], the compact solution increases the query time by roughly a factor of log d only. The solution to the dictionary matching problem is based on a new compressed representation of a suffix tree. Precisely, we give an O (n)-bit representation of a suffix tree for a dynamic collection of texts whose total length is n, which supports insertion and deletion of a text T in O(vertical bar T vertical bar log(2) n) time, as well as all suffix tree traversal operations, including forward and backward suffix links. This work can be regarded as a generalization of the compressed representation of static texts. In the study of the aforementioned result, we also derive the first O (n)-bit representation for maintaining n pairs of balanced parentheses in O (log n/log log n) time per operation, matching the time complexity of the previous O (n log n)-bit solution.
引用
收藏
页数:29
相关论文
共 33 条
[1]   EFFICIENT STRING MATCHING - AID TO BIBLIOGRAPHIC SEARCH [J].
AHO, AV ;
CORASICK, MJ .
COMMUNICATIONS OF THE ACM, 1975, 18 (06) :333-340
[2]  
AMIR A, 1991, PROCEEDINGS - 32ND ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, P760, DOI 10.1109/SFCS.1991.185445
[3]   DYNAMIC DICTIONARY MATCHING [J].
AMIR, A ;
FARACH, M ;
GALIL, Z ;
GIANCARLO, R ;
PARK, K .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1994, 49 (02) :208-222
[4]   IMPROVED DYNAMIC DICTIONARY MATCHING [J].
AMIR, A ;
FARACH, M ;
IDURY, RM ;
LAPOUTRE, JA ;
SCHAFFER, AA .
INFORMATION AND COMPUTATION, 1995, 119 (02) :258-282
[5]  
AMIR A, 1992, LECT NOTES COMPUT SC, V644, P262
[6]  
[Anonymous], 1994, BLOCK SORTING LOSSLE
[7]  
ELIAS P, 1975, IEEE T INFORM THEORY, V21, P194, DOI 10.1109/TIT.1975.1055349
[8]   Indexing compressed text [J].
Ferragina, P ;
Manzini, G .
JOURNAL OF THE ACM, 2005, 52 (04) :552-581
[9]  
Ferragina P, 2004, LECT NOTES COMPUT SC, V3246, P150
[10]   Opportunistic data structures with applications [J].
Ferragina, P ;
Manzini, G .
41ST ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, PROCEEDINGS, 2000, :390-398