Repeatable evaluation of search services in dynamic environments

被引:4
作者
Jensen, Eric C.
Beitzel, Steven M. [1 ]
Chowdhury, Abdur
Frieder, Ophir [1 ,2 ]
机构
[1] IIT, Chicago, IL 60616 USA
[2] Georgetown Univ, Washington, DC 20057 USA
关键词
algorithms; experimentation; evaluation; web search;
D O I
10.1145/1292591.1292592
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In dynamic environments, such as the World Wide Web, a changing document collection, query population, and set of search services demands frequent repetition of search effectiveness (relevance) evaluations. Reconstructing static test collections, such as in TREC, requires considerable human effort, as large collection sizes demand judgments deep into retrieved pools. In practice it is common to perform shallow evaluations over small numbers of live engines (often pairwise, engine A vs. engine B) without system pooling. Although these evaluations are not intended to construct reusable test collections, their utility depends on conclusions generalizing to the query population as a whole. We leverage the bootstrap estimate of the reproducibility probability of hypothesis tests in determining the query sample sizes required to ensure this, finding they are much larger than those required for static collections. We propose a semiautomatic evaluation framework to reduce this effort. We validate this framework against a manual evaluation of the top ten results of ten Web search engines across 896 queries in navigational and informational tasks. Augmenting manual judgments with pseudo-relevance judgments mined from Web taxonomies reduces both the chances of missing a correct pairwise conclusion, and those of finding an errant conclusion, by approximately 50%.
引用
收藏
页数:38
相关论文
共 63 条
[1]  
[Anonymous], TESTING STAT HYPOTHE
[2]  
ASLAM J, 2006, P ACM C RES DEV INF
[3]  
Aslam J. A., 2003, P 12 INT C INF KNOWL, P484, DOI DOI 10.1145/956863.956953
[4]   Peer review of statistics in medical research: the other problem [J].
Bacchetti, P .
BRITISH MEDICAL JOURNAL, 2002, 324 (7348) :1271-1273
[5]  
BEITZEL SM, 2003, P ACM C RES DEV INF
[6]  
BEITZEL SM, 2006, IN PRESS J AM SOC IN
[7]  
BEITZEL SM, 2003, P ACM C INF KNOWL MA
[8]  
BEITZEL SM, 2004, P ACM C RES DEV INF
[9]  
BLUSTEIN J, 1995, ACM C RES DEV INF RE
[10]   The concept of relevance in IR [J].
Borlund, P .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2003, 54 (10) :913-925