Universal text preprocessing for data compression

被引:21
作者
Abel, J
Teahan, W
机构
[1] Univ Duisberg Essen, Dept Commun Syst, Fac Engn Sci, D-47057 Duisburg, Germany
[2] Univ Wales, Sch Informat, Bangor LL57 1UT, Gwynedd, Wales
关键词
algorithms; data compression; BWT; LZ; PPM; preprocessing; text compression;
D O I
10.1109/TC.2005.85
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Several preprocessing algorithms for text files are presented which complement each other and which are performed prior to the compression scheme. The algorithms need no external dictionary and are language independent. The compression gain is compared along with the costs of speed for the BWT, PPM, and LZ compression schemes. The average overall compression gain is in the range of 3 to 5 percent for the text files of the Calgary Corpus and between 2 to 9 percent for the text files of the large Canterbury Corpus.
引用
收藏
页码:497 / 507
页数:11
相关论文
共 30 条
[1]  
ABEL J, 2003, UNPUB ACM T COMPUTER
[2]  
Awan FS, 2001, IEEE DATA COMPR CONF, P481
[3]  
BALKENHOL B, 1999, SFB343 U BIELEFELD
[4]   A LOCALLY ADAPTIVE DATA-COMPRESSION SCHEME [J].
BENTLEY, JL ;
SLEATOR, DD ;
TARJAN, RE ;
WEI, VK .
COMMUNICATIONS OF THE ACM, 1986, 29 (04) :320-330
[5]  
Burrows M, 1994, 124 DIG EQUIP CORP
[6]   Higher compression from the Burrows-Wheeler transform by modified sorting [J].
Chapin, B ;
Tate, SR .
DCC '98 - DATA COMPRESSION CONFERENCE, 1998, :532-532
[7]  
CHAPIN B, 2001, THESIS U N TEXAS DEP
[8]   DATA-COMPRESSION USING ADAPTIVE CODING AND PARTIAL STRING MATCHING [J].
CLEARY, JG ;
WITTEN, IH .
IEEE TRANSACTIONS ON COMMUNICATIONS, 1984, 32 (04) :396-402
[9]  
Deorowicz S, 2000, SOFTWARE PRACT EXPER, V30, P1465, DOI 10.1002/1097-024X(20001110)30:13<1465::AID-SPE345>3.0.CO
[10]  
2-D