Validating Synthetic Usage Data in Living Lab Environments

被引:1
作者
Breuer, Timo [1 ]
Fuhr, Norbert [2 ]
Schaer, Philipp [1 ]
机构
[1] TH Koln, Inst Informat Management, Fac Informat Sci & Commun Studies, Univ Appl Sci, D-50678 Cologne, Germany
[2] Univ Duisburg Essen, Dept Comp Sci & Appl Cognit Sci, Fac Engn Sci, D-47048 Duisburg, Germany
来源
ACM JOURNAL OF DATA AND INFORMATION QUALITY | 2024年 / 16卷 / 01期
关键词
Synthetic usage data; click signals; system evaluation; living labs;
D O I
10.1145/3623640
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Evaluating retrieval performance without editorial relevance judgments is challenging, but instead, user interactions can be used as relevance signals. Living labs offer a way for small-scale platforms to validate information retrieval systems with real users. If enough user interaction data are available, then click models can be parameterized from historical sessions to evaluate systems before exposing users to experimental rankings. However, interaction data are sparse in living labs, and little is studied about how click models can be validated for reliable user simulations when click data are available in moderate amounts. This work introduces an evaluation approach for validating synthetic usage data generated by click models in data-sparse human-in-the-loop environments like living labs. We ground our methodology on the click model's estimates about a system ranking compared to a reference ranking for which the relative performance is known. Our experiments compare different click models and their reliability and robustness as more session log data become available. In our setup, simple click models can reliably determine the relative system performance with already 20 logged sessions for 50 queries. In contrast, more complex click models require more session data for reliable estimates, but they are a better choice in simulated interleaving experiments when enough session data are available. While it is easier for clickmodels to distinguish between more diverse systems, it is harder to reproduce the system ranking based on the same retrieval algorithm with different interpolation weights. Our setup is entirely open, and we share the code to reproduce the experiments.
引用
收藏
页数:33
相关论文
共 102 条
  • [1] Agichtein E., 2006, Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P19, DOI 10.1145/1148170.1148177
  • [2] TripJudge: A Relevance Judgement Test Collection for TripClick Health Retrieval
    Althammer, Sophia
    Hofstaetter, Sebastian
    Verberne, Suzan
    Hanbury, Allan
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 3801 - 3805
  • [3] Amati G, 2006, LECT NOTES COMPUT SC, V3936, P13
  • [4] [Anonymous], 2014, P 5 INFORM INTERACTI, DOI [DOI 10.1145/2637002.2637028, 10.1145/2637002.2637028]
  • [5] Azzopardi L, 2011, LECT NOTES COMPUT SC, V6941, P26, DOI 10.1007/978-3-642-23708-9_5
  • [6] Designing and Deploying Online Field Experiments
    Bakshy, Eytan
    Eckles, Dean
    Bernstein, Michael S.
    [J]. WWW'14: PROCEEDINGS OF THE 23RD INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 2014, : 283 - 292
  • [7] Balog Krisztian, 2021, ACM SIGIR Forum, P1, DOI 10.1145/3527546.3527559
  • [8] CIKM 2013 Workshop on Living Labs for Information Retrieval Evaluation
    Balog, Krisztian
    Elsweiler, David
    Kanoulas, Evangelos
    Kelly, Liadh
    Smucker, Mark D.
    [J]. PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013,
  • [9] Balog Krisztian, 2014, P 23 ACM INT C CONFE, P1815, DOI [10 . 1145 / 2661829.2661962, DOI 10.1145/2661829.2661962, 10.1145/2661829.2661962]
  • [10] Modeling Behavioral Factors in Interactive Information Retrieval
    Baskaya, Feza
    Keskustalo, Heikki
    Jarvelin, Kalervo
    [J]. PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 2297 - 2302