Performance standards and evaluations in IR test collections: Vector-space and other retrieval models

被引:26
作者
Shaw, WM [1 ]
Burgin, R [1 ]
Howell, P [1 ]
机构
[1] N CAROLINA CENT UNIV, SCH LIB & INFORMAT SCI, DURHAM, NC 27707 USA
关键词
D O I
10.1016/S0306-4573(96)00044-1
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Low performance standards for each query and for the group of queries in 13 traditional and four TREC test collections have been computed. Predicted by the hypergeometric distribution, the standards represent the highest level of retrieval effectiveness attributable to chance. Operational levels of performance for vector-space, ad-hoc-feature-based, probabilistic, and other retrieval models have been compared to the standards. The effectiveness of these techniques in small, traditional test collections can be explained by retrieving a few more relevant documents for most queries than expected by chance, and the effectiveness of retrieval techniques in the large TREC test collections can only be explained by retrieving many more relevant documents for most queries than expected by chance. The discrepancy between deviations from chance in traditional and TREC text collections is due to a decrease in performance standards for large test collections, not to an increase in operational performance. Retrieving a few more relevant documents than expected by chance leads to mediocre levels of performance; recall and precision are rarely greater than 0.50 for any retrieval strategy in any test collection. However, marginal improvements to expectations based on chance may be sufficient to initiate successful interactions between an end-user and the next generation of retrieval systems, in which relevance judgments will be automatically translated into progressively improving estimates of the capacity of terms and other features to discriminate between relevant and non-relevant documents. Realization of such systems would be enhanced by abandoning uninformative performance summaries and focusing on effectiveness and improvements in effectiveness of individual queries. Copyright (C) 1997 Elsevier Science Ltd
引用
收藏
页码:15 / 36
页数:22
相关论文
共 125 条
  • [1] AALBERSBERB IJ, 1991, P 14 ANN INT ACM SIG, P72
  • [2] [Anonymous], 1988, Means and Their Inequalities
  • [3] [Anonymous], SIGIR
  • [4] [Anonymous], P 16 ANN INT ACM SIG
  • [5] [Anonymous], 1968, An introduction to probability theory and its applications
  • [6] BELKIN NJ, 1987, ANNU REV INFORM SCI, V22, P109
  • [7] BROGLIO J, 1995, OVERVIEW 3 TEXT RETR, P29
  • [8] Buckley C., 1994, Second Text REtrieval Conference (TREC-2) (NIST-SP 500-215), P45
  • [9] BUCKLEY C, 1995, OV 3 TEXT RETR C TRE, P69
  • [10] VARIATIONS IN RELEVANCE JUDGMENTS AND THE EVALUATION OF RETRIEVAL PERFORMANCE
    BURGIN, R
    [J]. INFORMATION PROCESSING & MANAGEMENT, 1992, 28 (05) : 619 - 627