Indexing Word Sequences for Ranked Retrieval

被引:10
作者
Huston, Samuel [1 ]
Culpepper, J. Shane [2 ]
Croft, W. Bruce [1 ]
机构
[1] Univ Massachusetts, Ctr Intelligent Informat Retrieval, Amherst, MA 01003 USA
[2] RMIT Univ, Sch Comp Sci & Informat Technol, Melbourne, Vic, Australia
基金
澳大利亚研究理事会; 美国国家科学基金会;
关键词
Sketching; indexing; scalability; term-dependency models; Algorithms; Experimentation; Performance; COUNT-MIN SKETCH; FINDING FREQUENT; STRATEGIES; ALGORITHMS;
D O I
10.1145/2559168
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Formulating and processing phrases and other term dependencies to improve query effectiveness is an important problem in information retrieval. However, accessing word-sequence statistics using inverted indexes requires unreasonable processing time or substantial space overhead. Establishing a balance between these competing space and time trade-offs can dramatically improve system performance. In this article, we present and analyze a new index structure designed to improve query efficiency in dependency retrieval models. By adapting a class of (is an element of, delta)-approximation algorithms originally proposed for sketch summarization in networking applications, we show how to accurately estimate statistics important in term-dependency models with low, probabilistically bounded error rates. The space requirements for the vocabulary of the index is only logarithmically linked to the size of the vocabulary. Empirically, we show that the sketch index can reduce the space requirements of the vocabulary component of an index of n-grams consisting of between 1 and 4 words extracted from the GOV2 collection to less than 0.01% of the space requirements of the vocabulary of a full index. We also show that larger n-gram queries can be processed considerably more efficiently than in current alternatives, such as positional and next-word indexes.
引用
收藏
页数:26
相关论文
共 62 条
[1]  
Abdel-Hamid O., 2009, Proceedings of the 18th international conference on World wide web, P61
[2]   The space complexity of approximating the frequency moments [J].
Alon, N ;
Matias, Y ;
Szegedy, M .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1999, 58 (01) :137-147
[3]  
[Anonymous], 2008, P 31 ANN INT ACM SIG
[4]  
[Anonymous], 2012, P JOINT C EMPI
[5]  
[Anonymous], 1998, SIGIR 98 P 21 ANN IN, DOI DOI 10.1145/290941.291008
[6]  
[Anonymous], 2016, Information retrieval: Implementing and evaluating search engines
[7]  
Bendersky M, 2012, SIGIR 2012: PROCEEDINGS OF THE 35TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P941, DOI 10.1145/2348283.2348408
[8]  
Bendersky Michael, 2010, P 3 ACM INT C WEB SE, P31
[9]  
Bergsma S., 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, P819
[10]   Space-optimal Heavy Hitters with Strong Error Bounds [J].
Berinde, Radu ;
Indyk, Piotr ;
Cormode, Graham ;
Strauss, Martin J. .
PODS'09: PROCEEDINGS OF THE TWENTY-EIGHTH ACM SIGMOD-SIGACT-SIGART SYMPOSIUM ON PRINCIPLES OF DATABASE SYSTEMS, 2009, :157-166