Which Diversity Evaluation Measures Are "Good"?

被引：29

作者：

Sakai, Tetsuya ^{[1
]}

Zeng, Zhaohao ^{[1
]}

机构：

[1] Waseda Univ, Tokyo, Japan

来源：

PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19) | 2019年

关键词：

evaluation measures; search result diversification; user preferences; METRICS;

D O I：

10.1145/3331184.3331215

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This study evaluates 30 IR evaluation measures or their instances, of which nine are for adhoc IR and 21 are for diversified IR, primarily from the viewpoint of whether their preferences of one SERP (search engine result page) over another actually align with users' preferences. The gold preferences were contructed by hiring 15 assessors, who independently examined 1,127 SERP pairs and made preference assessments. Two sets of preference assessments were obtained: one based on a relevance question "Which SERP is more relevant?" and the other based on a diversity question "Which SERP is likely to satisfy a higher number of users?" To our knowledge, our study is the first to have collected diversity preference assessments in this way and evaluated diversity measures successfully. Our main results are that (a) Popular adhoc IR measures such as nDCG actually align quite well with the gold relevance preferences; and that (b) While the D#-measures align well with the gold diversity preferences, intent-aware measures perform relatively poorly. Moreover, as by-products of our analysis of existing evaluation measures, we define new adhoc measures called iRBU (intentwise Rank-Biased Utility) and EBR (Expected Blended Ratio); we demonstrate that an instance of iRBU performs as well as nDCG when compared to the gold relevance preferences. On the other hand, the original RBU, a recently-proposed diversity measure, underperforms the best D#-measures when compared to the gold diversity preferences.

引用

页码：595 / 604

页数：10

共 40 条

[1]

Agrawal R., 2009, PROC 2 ACM INT C WEB, P5, DOI DOI 10.1145/1498759.1498766

[2]

Al-Maskari Azzah., 2008, Proceedings of the 31st annual international ACM SIGIR conference on Re- search and development in information retrieval, P59

[3] Desirable Properties for Diversity and Truncated Effectiveness Metrics [J].

Albahem, Ameer ;

Spina, Damiano ;

Scholer, Falk ;

Moffat, Alistair ;

Cavedon, Lawrence .

ADCS'18: PROCEEDINGS OF THE 23RD AUSTRALASIAN DOCUMENT COMPUTING SYMPOSIUM, 2018,

[4]

Albahem Ameer, 2019, P ECIR 2019

[5] An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the Rank-Biased Utility Metric [J].

Amigo, Enrique ;

Spina, Damiano ;

Carrillo-de-Albornoz, Jorge .

ACM/SIGIR PROCEEDINGS 2018, 2018, :625-634

[6]

Amigó E, 2013, SIGIR'13: THE PROCEEDINGS OF THE 36TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH & DEVELOPMENT IN INFORMATION RETRIEVAL, P643

[7] Measuring the Utility of Search Engine Result Pages [J].

Azzopardi, Leif ;

Thomas, Paul ;

Craswell, Nick .

ACM/SIGIR PROCEEDINGS 2018, 2018, :605-614

[8]

Carterette B, 2011, PROCEEDINGS OF THE 34TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR'11), P903

[9]

Chandar P, 2013, SIGIR'13: THE PROCEEDINGS OF THE 36TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH & DEVELOPMENT IN INFORMATION RETRIEVAL, P413

[10] Intent-based diversification of web search results: metrics and algorithms [J].

Chapelle, Olivier ;

Ji, Shihao ;

Liao, Ciya ;

Velipasaoglu, Emre ;

Lai, Larry ;

Wu, Su-Lin .

INFORMATION RETRIEVAL, 2011, 14 (06) :572-592

← 1 2 3 4 →