An exploration of proximity measures in information retrieval

被引:76
作者
Tao, Tao [1 ]
Zhai, Chengxiang [2 ]
机构
[1] Microsoft Corporation, Redmond
[2] Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana
来源
Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07 | 2007年
关键词
Distance measures; Proximity; Retrieval heuristics;
D O I
10.1145/1277741.1277794
中图分类号
学科分类号
摘要
In most existing retrieval models, documents are scored primarily based on various kinds of term statistics such as within-document frequencies, inverse document frequencies, and document lengths. Intuitively, the proximity of matched query terms in a document can also be exploited to promote scores of documents in which the matched query terms are close to each other. Such a proximity heuristic, however, has been largely under-explored in the literature; it is unclear how we can model proximity and incorporate a proximity measure into an existing retrieval model. In this paper,we systematically explore the query term proximity heuristic. Specifically, we propose and study the effectiveness of five different proximity measures, each modeling proximity from a different perspective. We then design two heuristic constraints and use them to guide us in incorporating the proposed proximity measures into an existing retrieval model. Experiments on five standard TREC test collections show that one of the proposed proximity measures is indeed highly correlated with document relevance, and by incorporating it into the KL-divergence language model and the Okapi BM25 model, we can significantly improve retrieval performance. Copyright 2007 ACM.
引用
收藏
页码:295 / 302
页数:7
相关论文
共 32 条
  • [1] Beigbeder M., Mercier A., An information retrieval model using the fuzzy proximity degree of term occurences, Proceedings of the 2005 ACM Symposium on Applied Computing (SAC 05), pp. 1018-1022, (2005)
  • [2] Buttcher S., Clarke C., Lushman B., Term proximity scoring for ad-hoc retrieval on very large text collections, SIGIR '03: Proceedings of the 26nd annual international ACM SIGIR conference on Research and development in information retrieval, (2006)
  • [3] Buttcher S., Clarke C.L.A., Efficiency vs. effectiveness in terabyte-scale information retrieval, Proceedings of TREC 2005, (2005)
  • [4] Callan J.P., Passage-Level Evidence in Document Retrieval, Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 302-310, (1994)
  • [5] Clarke C.L.A., Cormack G.V., Burkowski F.J., Shortest substring ranking, Proceedings of the Fourth Text REtrieval Conference (TREC-4), pp. 295-304, (1995)
  • [6] Croft W.B., Lafferty J., Language Modeling for Information Retrieval, (2003)
  • [7] Fang H., Tao T., Zhai C., A formal study of information retrieval heuristics, Proceedings of the 27th annual international conference on Research and development in information retrieval, pp. 49-56, (2004)
  • [8] Fang H., Zhai C., An exploration of axiomatic approaches to information retrieval, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 480-487, (2005)
  • [9] Fuhr N., Probabilistic models in information retrieval, The Computer Journal, 35, 3, pp. 243-255, (1992)
  • [10] Hawking D., Thistlewaite P., Proximity operators - so near and yet so far, Proceedings of the Fourth Text REtrieval Conference (TREC-4), pp. 131-143, (1995)