Index compression using 64-bit words

被引：69

作者：

Anh, Vo Ngoc ^{[1
]}

Moffat, Alistair ^{[1
]}

机构：

[1] Univ Melbourne, Dept Comp Sci & Software Engn, Melbourne, Vic 3010, Australia

来源：

SOFTWARE-PRACTICE & EXPERIENCE | 2010年 / 40卷 / 02期

基金：

澳大利亚研究理事会;

关键词：

performance; measurement; index compression; information retrieval; TEXT RETRIEVAL; INFORMATION-RETRIEVAL; INVERTED FILES; SYSTEMS;

D O I：

10.1002/spe.948

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Modern computers typically make use of 64-bit words as the fundamental unit of data access. However the decade-long migration from 32-bit architectures has not been reflected in compression technology, because of a widespread assumption that effective compression techniques operate in terms of bits or bytes, rather than words. Here we demonstrate that the use of 64-bit access units, especially in connection with word-bounded codes, does indeed provide the opportunity for improving the compression performance. In particular, we extend several 32-bit word-bounded coding schemes to 64-bit operation and explore their uses in information retrieval applications. Our results show that the Simple-8b approach, a 64-bit word-bounded code, is an excellent self-skipping code, and has a clear advantage over its competitors in supporting fast query evaluation when the data being compressed represents the inverted index for a large text collection. The advantages of the new code also accrue on 32-bit architectures, and for all of Boolean. ranked, and phrase queries; which means that it can be used in any situation. Copyright (C) 2010 John Wiley & Sons, Ltd.

引用

页码：131 / 147

页数：17

共 24 条

[1] Improved word-aligned binary compression for text indexing [J].

Anh, VN ;

Moffat, A .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2006, 18 (06) :857-861

[2] Inverted index compression using word-aligned binary codes [J].

Anh, VN ;

Moffat, A .

INFORMATION RETRIEVAL, 2005, 8 (01) :151-166

[3]

[Anonymous], 2008, P 17 INT C WORLD WID, DOI [DOI 10.1145/1367497.1367550, 10.1145/1367497.1367550]

[4]

[Anonymous], 2008, Introduction to information retrieval

[5]

[Anonymous], 2006, ICDE 59, DOI DOI 10.1109/ICDE.2006.150

[6]

Barbay J, 2006, LECT NOTES COMPUT SC, V4007, P146

[7]

Brisaboa NR, 2003, LECT NOTES COMPUT SC, V2857, P122

[8]

BUTTCHER S, 2006, UNALIGNED BINARY COD, P10

[9]

Buttcher Stefan, 2007, P 16 ACM C INF KNOWL, P761, DOI [10.1145/1321440.1321546, DOI 10.1145/1321440.1321546]

[10] Unique-order interpolative coding for fast querying and space-efficient indexing in information retrieval systems [J].

Cheng, CS ;

Shann, JJJ ;

Chung, CP .

INFORMATION PROCESSING & MANAGEMENT, 2006, 42 (02) :407-428

← 1 2 3 →