TSP and cluster-based solutions to the reassignment of document identifiers

被引:13
作者
Blanco, Roi
Barreiro, Alvaro
机构
[1] University of Corunna,IRLab. Computer Science Department
来源
INFORMATION RETRIEVAL | 2006年 / 9卷 / 04期
关键词
document identifier reassignment; SVD; indexing; compression; TSP; clustering;
D O I
10.1007/s10791-006-6614-y
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent studies demonstrated that it is possible to reduce Inverted Files (IF) sizes by reassigning the document identifiers of the original collection, as this lowers the distance between the positions of documents related to a single term. Variable-bit encoding schemes can exploit the average gap reduction and decrease the total amount of bits per document pointer. This paper presents an efficient solution to the reassignment problem, which consists in reducing the input data dimensionality using a SVD transformation, as well as considering it a Travelling Salesman Problem (TSP). We also present some efficient solutions based on clustering. Finally, we combine both the TSP and the clustering strategies for reordering the document identifiers. We present experimental tests and performance results in two text TREC collections, obtaining good compression ratios with low running times, and advance the possibility of obtaining scalable solutions for web collections based on the techniques presented here.
引用
收藏
页码:499 / 517
页数:19
相关论文
共 18 条
[1]  
BARTELL BT, 1992, P 15 ANN INT ACM SIG, P161
[2]  
Blanco R, 2005, LECT NOTES COMPUT SC, V3408, P375
[3]   Index compression through document reordering [J].
Blandford, D ;
Blelloch, G .
DCC 2002: DATA COMPRESSION CONFERENCE, PROCEEDINGS, 2002, :342-351
[4]  
DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO
[5]  
2-9
[6]  
DUMAIS ST, 1994, NIST SPECIAL PUBLICA
[7]  
ELIAS P, 1975, IEEE T INFORM THEORY, V21, P194, DOI 10.1109/TIT.1975.1055349
[8]   OPTIMAL SOURCE CODES FOR GEOMETRICALLY DISTRIBUTED INTEGER ALPHABETS [J].
GALLAGER, RG ;
VANVOORHIS, DC .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1975, 21 (02) :228-230
[9]   RUN-LENGTH ENCODINGS [J].
GOLOMB, SW .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1966, 12 (03) :399-+
[10]   Data clustering: A review [J].
Jain, AK ;
Murty, MN ;
Flynn, PJ .
ACM COMPUTING SURVEYS, 1999, 31 (03) :264-323