Index compression using 64-bit words

被引:69
作者
Anh, Vo Ngoc [1 ]
Moffat, Alistair [1 ]
机构
[1] Univ Melbourne, Dept Comp Sci & Software Engn, Melbourne, Vic 3010, Australia
基金
澳大利亚研究理事会;
关键词
performance; measurement; index compression; information retrieval; TEXT RETRIEVAL; INFORMATION-RETRIEVAL; INVERTED FILES; SYSTEMS;
D O I
10.1002/spe.948
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Modern computers typically make use of 64-bit words as the fundamental unit of data access. However the decade-long migration from 32-bit architectures has not been reflected in compression technology, because of a widespread assumption that effective compression techniques operate in terms of bits or bytes, rather than words. Here we demonstrate that the use of 64-bit access units, especially in connection with word-bounded codes, does indeed provide the opportunity for improving the compression performance. In particular, we extend several 32-bit word-bounded coding schemes to 64-bit operation and explore their uses in information retrieval applications. Our results show that the Simple-8b approach, a 64-bit word-bounded code, is an excellent self-skipping code, and has a clear advantage over its competitors in supporting fast query evaluation when the data being compressed represents the inverted index for a large text collection. The advantages of the new code also accrue on 32-bit architectures, and for all of Boolean. ranked, and phrase queries; which means that it can be used in any situation. Copyright (C) 2010 John Wiley & Sons, Ltd.
引用
收藏
页码:131 / 147
页数:17
相关论文
共 24 条
[1]   Improved word-aligned binary compression for text indexing [J].
Anh, VN ;
Moffat, A .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2006, 18 (06) :857-861
[2]   Inverted index compression using word-aligned binary codes [J].
Anh, VN ;
Moffat, A .
INFORMATION RETRIEVAL, 2005, 8 (01) :151-166
[3]  
[Anonymous], 2008, P 17 INT C WORLD WID, DOI [DOI 10.1145/1367497.1367550, 10.1145/1367497.1367550]
[4]  
[Anonymous], 2008, Introduction to information retrieval
[5]  
[Anonymous], 2006, ICDE 59, DOI DOI 10.1109/ICDE.2006.150
[6]  
Barbay J, 2006, LECT NOTES COMPUT SC, V4007, P146
[7]  
Brisaboa NR, 2003, LECT NOTES COMPUT SC, V2857, P122
[8]  
BUTTCHER S, 2006, UNALIGNED BINARY COD, P10
[9]  
Buttcher Stefan, 2007, P 16 ACM C INF KNOWL, P761, DOI [10.1145/1321440.1321546, DOI 10.1145/1321440.1321546]
[10]   Unique-order interpolative coding for fast querying and space-efficient indexing in information retrieval systems [J].
Cheng, CS ;
Shann, JJJ ;
Chung, CP .
INFORMATION PROCESSING & MANAGEMENT, 2006, 42 (02) :407-428