Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments

被引:61
作者
Carterette, Benjamin A. [1 ]
机构
[1] Univ Delaware, Dept Comp & Informat Syst, Newark, DE 19716 USA
关键词
Experimentation; Measurement; Theory; Information retrieval; effectiveness evaluation; test collections; experimental design; statistical analysis; INFERENCE;
D O I
10.1145/2094072.2094076
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
High-quality reusable test collections and formal statistical hypothesis testing together support a rigorous experimental environment for information retrieval research. But as Armstrong et al. [2009b] recently argued, global analysis of experiments suggests that there has actually been little real improvement in ad hoc retrieval effectiveness over time. We investigate this phenomenon in the context of simultaneous testing of many hypotheses using a fixed set of data. We argue that the most common approaches to significance testing ignore a great deal of information about the world. Taking into account even a fairly small amount of this information can lead to very different conclusions about systems than those that have appeared in published literature. We demonstrate how to model a set of IR experiments for analysis both mathematically and practically, and show that doing so can cause p-values from statistical hypothesis tests to increase by orders of magnitude. This has major consequences on the interpretation of experimental results using reusable test collections: it is very difficult to conclude that anything is significant once we have modeled many of the sources of randomness in experimental design and analysis.
引用
收藏
页数:34
相关论文
共 38 条
  • [1] [Anonymous], 1980, Multivariate Analysis
  • [2] [Anonymous], 2008, P 17 ACM C INFORM KN
  • [3] ARMSTRONG T. G., 2009, P 32 ANN INT ACM SIG, P25
  • [4] ARMSTRONG T. G., 2009, P 18 ACM C INF KNOWL
  • [5] Could Fisher, Jeffreys and Neyman have agreed on testing?
    Berger, JO
    [J]. STATISTICAL SCIENCE, 2003, 18 (01) : 1 - 12
  • [6] Box G. E. P., 1979, Robustness in Statistics, V1, P201, DOI [10.1016/B9780-12-438150-6.50018-2, DOI 10.1016/B9780-12-438150-6.50018-2]
  • [7] Bretz F., 2010, MULTIPLE COMPARISONS
  • [8] Buckley C., 2000, Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P33, DOI DOI 10.1145/345508.345543
  • [9] CARTERETTE B., 2011, P 3 INT C THEOR INF
  • [10] Carterette Ben., 2007, P CIKM, P643, DOI DOI 10.1145/1321440.1321530