Navigating Imprecision in Relevance Assessments on the Road to Total Recall: Roger and Me

被引：12

作者：

Cormack, Gordon V. ^{[1
]}

Grossman, Maura R. ^{[1
]}

机构：

[1] Univ Waterloo, Waterloo, ON, Canada

来源：

SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL | 2017年

关键词：

D O I：

10.1145/3077136.3080812

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Technology-assisted review ("TAR") systems seek to achieve "total recall"; that is, to approach, as nearly as possible, the ideal of 100% recall and 100% precision, while minimizing human review effort. The literature reports that TAR methods using relevance feedback can achieve considerably greater than the 65% recall and 65% precision reported by Voorhees as the "practical upper bound on retrieval performance... since that is the level at which humans agree with one another" (Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, 2000). This work argues that in order to build-as well as to, evaluate-TAR systems that approach 100% recall and 100% precision, it is necessary to model human assessment, not as absolute ground truth, but as an indirect indicator of the amorphous property known as "relevance." The choice of model impacts both the evaluation of system effectiveness, as well as the simulation of relevance feedback. Models are presented that better fit available data than the infallible ground-truth model. These models suggest ways to improve TAR-system effectiveness so that hybrid human-computer systems can improve on both the accuracy and efficiency of human review alone. This hypothesis is tested by simulating TAR using two datasets: the TREC 4 AdHoc collection, and a dataset consisting of 401,960 email messages that were manually reviewed and classified by a single individual, Roger, in his official capacity as Senior State Records Archivist. The results using the TREC 4 data show that TAR achieves higher recall and higher precision than the assessments by either of two independent NIST assessors, and blind adjudication of the email dataset, conducted by Roger, more than two years after his original review, shows that he could have achieved the same recall and better precision, while reviewing substantially fewer than 401,960 emails, had he employed TAR in place of exhaustive manual review.

引用

页码：5 / 14

页数：10

共 32 条

[1] A Review of Factors Influencing User Satisfaction in Information Retrieval [J].

Al-Maskari, Azzah ;

Sanderson, Mark .

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2010, 61 (05) :859-868

[2]

[Anonymous], CLEFEL

[3]

Aslam J. A., 2006, SIGIR

[4]

Bailey P., SIGIR 2008

[5] AN EVALUATION OF RETRIEVAL EFFECTIVENESS FOR A FULL-TEXT DOCUMENT-RETRIEVAL SYSTEM [J].

BLAIR, DC ;

MARON, ME .

COMMUNICATIONS OF THE ACM, 1985, 28 (03) :289-299

[6]

Cormack G., TREC 2009

[7]

Cormack G. V., SIGIR 2015

[8]

Cormack G. V., TREC 2015

[9]

Cormack G. V., SIGIR 2016

[10]

Cormack G. V., SIGIR 2014

← 1 2 3 4 →