Test collection based evaluation of information retrieval systems

被引:203
作者
Sanderson M. [1 ]
机构
[1] Information School, University of Sheffield, Sheffield
来源
Foundations and Trends in Information Retrieval | 2010年 / 4卷 / 04期
关键词
D O I
10.1561/1500000009
中图分类号
学科分类号
摘要
Use of test collections and evaluation measures to assess the effectiveness of information retrieval systems has its origins in work dating back to the early 1950s. Across the nearly 60 years since that work started, use of test collections is a de facto standard of evaluation. This monograph surveys the research conducted and explains the methods and measures devised for evaluation of retrieval systems, including a detailed look at the use of statistical significance testing in retrieval experimentation. This monograph reviews more recent examinations of the validity of the test collection approach and evaluation measures as well as outlining trends in current research exploiting query logs and live labs. At its core, the modern-day test collection is little different from the structures that the pioneering researchers in the 1950s and 1960s conceived of. This tutorial and review shows that despite its age, this long-standing evaluation method is still a highly valued tool for retrieval research. © 2010 M. Sanderson.
引用
收藏
页码:247 / 375
页数:128
相关论文
共 184 条
[1]  
Agichtein E., Brill E., Dumais S., Ragno R., Learning user interaction models for predicting web search result preferences, Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3-10, (2006)
[2]  
Agrawal R., Gollapudi S., Halverson A., Ieong S., Diversifying search results, Proceedings of the Second ACM International Conference on Web Search and Data Mining, pp. 5-14, (2009)
[3]  
Al-Maskari A., Sanderson M., Clough P., The relationship between IR effectiveness measures and user satisfaction, Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 773-774, (2007)
[4]  
Al-Maskari A., Sanderson M., Clough P., Airio E., The good and the bad system: Does the test collection predict users' effectiveness?, Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 59-66, (2008)
[5]  
Allan J., Topic detection and tracking, Event-based Information Organization, 12, (2002)
[6]  
Allan J., Carterette B., Lewis J., When will information retrieval be "good enough"?, Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 433-440, (2005)
[7]  
Alonso O., Rose D.E., Stewart B., Crowdsourcing for relevance evaluation, ACM SIGIR Forum, 42, 2, pp. 9-15, (2008)
[8]  
Altman D.G., Practical Statistics for Medical Research, (1990)
[9]  
Amitay E., Carmel D., Lempel R., Soffer A., Scaling IR-system evaluation using term relevance sets, Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 10-17, (2004)
[10]  
Arni T., Clough P., Sanderson M., Grubinger M., Overview of the image-CLEFphoto 2008 photographic retrieval task, Evaluating Systems for Multilingual and Multimodal Information Access, Lecture Notes in Computer Science, (2009)