TripJudge: A Relevance Judgement Test Collection for TripClick Health Retrieval

被引:4
作者
Althammer, Sophia [1 ]
Hofstaetter, Sebastian [1 ]
Verberne, Suzan [2 ]
Hanbury, Allan [1 ]
机构
[1] Vienna Univ Technol, Vienna, Austria
[2] Leiden Univ, Leiden, Netherlands
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022 | 2022年
关键词
Test collections; health retrieval; relevance judgements;
D O I
10.1145/3511808.3557714
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Robust test collections are crucial for Information Retrieval research. Recently there is a growing interest in evaluating retrieval systems for domain-specific retrieval tasks, however these tasks often lack a reliable test collection with human-annotated relevance assessments following the Cranfield paradigm. In the medical domain, the TripClick collection was recently proposed, which contains click log data from the Trip search engine and includes two click-based test sets. However the clicks are biased to the retrieval model used, which remains unknown, and a previous study shows that the test sets have a low judgement coverage for the Top-10 results of lexical and neural retrieval models. In this paper we present the novel, relevance judgement test collection TripJudge for TripClick health retrieval. We collect relevance judgements in an annotation campaign and ensure the quality and reusability of TripJudge by a variety of ranking methods for pool creation, by multiple judgements per query-document pair and by an at least moderate inter-annotator agreement. We compare system evaluation with TripJudge and TripClick and find that that click and judgement-based evaluation can lead to substantially different system rankings.
引用
收藏
页码:3801 / 3805
页数:5
相关论文
共 37 条
[1]   Using crowdsourcing for TREC relevance assessment [J].
Alonso, Omar ;
Mizzaro, Stefano .
INFORMATION PROCESSING & MANAGEMENT, 2012, 48 (06) :1053-1066
[2]  
Althammer S., 2022, PROC ADV INF RETR 44, P1
[3]  
Althammer S., 2021, Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, 28 March1 April 2021, Proceedings, Part II 43
[4]  
Bailey Peter, 2008, P 31 ANN INT ACM SIG, P667, DOI DOI 10.1145/1390334.1390447
[5]   The concept of relevance in IR [J].
Borlund, P .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2003, 54 (10) :913-925
[6]  
Buckley C., 2004, Proceedings of Sheffield SIGIR 2004. The Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P25, DOI 10.1145/1008992.1009000
[7]   Bias and the limits of pooling for large collections [J].
Buckley, Chris ;
Dimmick, Darrin ;
Soboroff, Ian ;
Voorhees, Ellen .
INFORMATION RETRIEVAL, 2007, 10 (06) :491-508
[8]  
Clarke Charles, 2012, TEXT RETR C TREC
[9]  
Cleverdon C. W., 1991, SIGIR Forum, P3
[10]   A COEFFICIENT OF AGREEMENT FOR NOMINAL SCALES [J].
COHEN, J .
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1960, 20 (01) :37-46