Validating Synthetic Usage Data in Living Lab Environments

被引：1

作者：

Breuer, Timo ^{[1
]}

Fuhr, Norbert ^{[2
]}

Schaer, Philipp ^{[1
]}

机构：

[1] TH Koln, Inst Informat Management, Fac Informat Sci & Commun Studies, Univ Appl Sci, D-50678 Cologne, Germany

[2] Univ Duisburg Essen, Dept Comp Sci & Appl Cognit Sci, Fac Engn Sci, D-47048 Duisburg, Germany

来源：

ACM JOURNAL OF DATA AND INFORMATION QUALITY | 2024年 / 16卷 / 01期

关键词：

Synthetic usage data; click signals; system evaluation; living labs;

D O I：

10.1145/3623640

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Evaluating retrieval performance without editorial relevance judgments is challenging, but instead, user interactions can be used as relevance signals. Living labs offer a way for small-scale platforms to validate information retrieval systems with real users. If enough user interaction data are available, then click models can be parameterized from historical sessions to evaluate systems before exposing users to experimental rankings. However, interaction data are sparse in living labs, and little is studied about how click models can be validated for reliable user simulations when click data are available in moderate amounts. This work introduces an evaluation approach for validating synthetic usage data generated by click models in data-sparse human-in-the-loop environments like living labs. We ground our methodology on the click model's estimates about a system ranking compared to a reference ranking for which the relative performance is known. Our experiments compare different click models and their reliability and robustness as more session log data become available. In our setup, simple click models can reliably determine the relative system performance with already 20 logged sessions for 50 queries. In contrast, more complex click models require more session data for reliable estimates, but they are a better choice in simulated interleaving experiments when enough session data are available. While it is easier for clickmodels to distinguish between more diverse systems, it is harder to reproduce the system ranking based on the same retrieval algorithm with different interpolation weights. Our setup is entirely open, and we share the code to reproduce the experiments.

引用

页数：33

共 102 条

[1] Agichtein E., 2006, Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P19, DOI 10.1145/1148170.1148177
[2] TripJudge: A Relevance Judgement Test Collection for TripClick Health Retrieval
Althammer, Sophia
Hofstaetter, Sebastian
Verberne, Suzan
Hanbury, Allan
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 3801 - 3805
[3] Amati G, 2006, LECT NOTES COMPUT SC, V3936, P13
[4] [Anonymous], 2014, P 5 INFORM INTERACTI, DOI [DOI 10.1145/2637002.2637028, 10.1145/2637002.2637028]
[5] Azzopardi L, 2011, LECT NOTES COMPUT SC, V6941, P26, DOI 10.1007/978-3-642-23708-9_5
[6] Designing and Deploying Online Field Experiments
Bakshy, Eytan
Eckles, Dean
Bernstein, Michael S.
[J]. WWW'14: PROCEEDINGS OF THE 23RD INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 2014, : 283 - 292
[7] Balog Krisztian, 2021, ACM SIGIR Forum, P1, DOI 10.1145/3527546.3527559
[8] CIKM 2013 Workshop on Living Labs for Information Retrieval Evaluation
Balog, Krisztian
Elsweiler, David
Kanoulas, Evangelos
Kelly, Liadh
Smucker, Mark D.
[J]. PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013,
[9] Balog Krisztian, 2014, P 23 ACM INT C CONFE, P1815, DOI [10 . 1145 / 2661829.2661962, DOI 10.1145/2661829.2661962, 10.1145/2661829.2661962]
[10] Modeling Behavioral Factors in Interactive Information Retrieval
Baskaya, Feza
Keskustalo, Heikki
Jarvelin, Kalervo
[J]. PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 2297 - 2302

← 1 2 3 4 5 6 7 8 9 10 →