Binary interpolative coding for effective index compression

被引:114
作者
Moffat, A [1 ]
Stuiver, L [1 ]
机构
[1] Univ Melbourne, Dept Comp Sci & Software Engn, Melbourne, Vic 3010, Australia
来源
INFORMATION RETRIEVAL | 2000年 / 3卷 / 01期
关键词
index compression; context-based model; document database;
D O I
10.1023/A:1013002601898
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
information retrieval systems contain large volumes of text, and currently have typical sizes into the gigabyte range. Inverted indexes are one important method for providing search facilities into these collections, but unless compressed require a great deal of space. In this paper we introduce a new method for compressing inverted indexes that yields excellent compression, fast decoding, and exploits clustering-the tendency for words to appear relatively frequently in some parts of thc collection and infrequently in others. We also describe two other quite separate applications for the same compression method: representing the MTF list positions generated by the Burrows-Wheeler Block Sorting transformation; and transmitting the codebook for se mi-static block-based minimum-redundancy coding.
引用
收藏
页码:25 / 47
页数:23
相关论文
共 33 条
  • [1] ANH VN, 1998, P 21 ANN INT ACM SIG, P290
  • [2] BELL TC, 1993, J AM SOC INFORM SCI, V44, P508, DOI 10.1002/(SICI)1097-4571(199310)44:9<508::AID-ASI2>3.0.CO
  • [3] 2-A
  • [4] COMPRESSION OF CORRELATED BIT-VECTORS
    BOOKSTEIN, A
    KLEIN, ST
    [J]. INFORMATION SYSTEMS, 1991, 16 (04) : 387 - 400
  • [5] Bookstein A., 1994, Proceedings DCC '94. Data Compression Conference (Cat. No.94TH0626-2), P116, DOI 10.1109/DCC.1994.305919
  • [6] Modeling word occurrences for the compression of concordances
    Bookstein, A
    Klein, ST
    Raita, T
    [J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1997, 15 (03) : 254 - 290
  • [7] A SYSTEMATIC-APPROACH TO COMPRESSING A FULL-TEXT RETRIEVAL-SYSTEM
    BOOKSTEIN, A
    KLEIN, ST
    ZIFF, DA
    [J]. INFORMATION PROCESSING & MANAGEMENT, 1992, 28 (06) : 795 - 806
  • [8] BURROWS M, 1994, 124 DIG EQ CORP PAL
  • [9] CHOUEKA Y, 1986, P 9 ACM SIGIR C, P88
  • [10] ELIAS P, 1975, IEEE T INFORM THEORY, V21, P194, DOI 10.1109/TIT.1975.1055349