New adaptive compressors for natural language text

被引:16
作者
Brisaboa, N. R. [1 ]
Farina, A. [1 ]
Navarro, G. [2 ]
Parama, J. R. [1 ]
机构
[1] Univ A Coruna, Dept Comp Sci, Database Lab, La Coruna 15071, Spain
[2] Univ Chile, Dept Comp Sci, Ctr Web Res, Santiago, Chile
关键词
text databases; natural language text compression; dynamic compression; searching compressed text;
D O I
10.1002/spe.882
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Semistatic byte-oriented word-based compression codes have been shown to be an attractive alternative to compress natural language text databases, because of the combination of speed, effectiveness, and direct searchability they offer. In particular, our recently proposed family of dense compression codes has been shown to be superior to the more traditional byte-oriented word-based Huffman codes in most aspects. In this paper, we focus on the problem of transmitting texts among peers that do not share the vocabulary. This is the typical scenario for adaptive compression methods. We design adaptive variants of our semistatic dense codes, showing that they are much simpler and faster than dynamic Huffman codes and reach almost the same compression effectiveness. We show that our variants have a very compelling trade-off between compression/decompression speed, compression ratio, and search speed compared with most of the state-of-the-art general compressors. Copyright (C) 2008 John Wiley & Sons, Ltd.
引用
收藏
页码:1429 / 1450
页数:22
相关论文
共 25 条
[1]  
Allauzen C, 1999, LECT NOTES COMPUT SC, V1725, P295
[2]  
[Anonymous], 2002, FLEXIBLE PATTERN MAT
[3]  
[Anonymous], 1994, BLOCK SORTING LOSSLE
[4]   A LOCALLY ADAPTIVE DATA-COMPRESSION SCHEME [J].
BENTLEY, JL ;
SLEATOR, DD ;
TARJAN, RE ;
WEI, VK .
COMMUNICATIONS OF THE ACM, 1986, 29 (04) :320-330
[5]   FAST STRING SEARCHING ALGORITHM [J].
BOYER, RS ;
MOORE, JS .
COMMUNICATIONS OF THE ACM, 1977, 20 (10) :762-772
[6]  
BRISABOA N, 2005, P 28 ANN INT ACM SIG, P234
[7]   Lightweight natural language text compression [J].
Brisaboa, Nieves R. ;
Farina, Antonio ;
Navarro, Gonzalo ;
Parama, Jose R. .
INFORMATION RETRIEVAL, 2007, 10 (01) :1-33
[8]  
Brisaboa NR, 2004, LECT NOTES COMPUT SC, V3246, P230
[9]  
CARPINELLI J, 2004, WORD CHARACTER INTEG
[10]  
Culpepper JS, 2005, LECT NOTES COMPUT SC, V3772, P1