Task-based evaluation of text summarization using relevance prediction

被引:17
作者
Hobson, Stacy President [1 ]
Dorr, Bonnie J.
Monz, Christof
Schwartz, Richard
机构
[1] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA
[2] Univ Maryland, UMIACS, College Pk, MD 20742 USA
[3] Queen Mary Univ London, Dept Comp Sci, London E1 4NS, England
[4] BBN Technol, Columbia, MD 21046 USA
关键词
summarization evaluation; summary usefulness; relevance prediction;
D O I
10.1016/j.ipm.2007.01.002
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This article introduces a new task-based evaluation measure called Relevance Prediction that is a more intuitive measure of an individual's performance on a real-world task than interannotator agreement. Relevance Prediction parallels what a user does in the real world task of browsing a set of documents using standard search tools, i.e., the user judges relevance based on a short summary and then that same user-not an independent user-decides whether to open (and judge) the corresponding document. This measure is shown to be a more reliable measure of task performance than LDC Agreement, a current gold-standard based measure used in the. summarization evaluation community. Our goal is to provide a stable framework within which developers of new automatic measures may make stronger statistical statements about the effectiveness of their measures in predicting summary usefulness. We demonstrate-as a proof-of-concept methodology for automatic metric developers-that a current automatic evaluation measure has a better correlation with Relevance Prediction than with LDC Agreement and that the significance level for detected differences is higher for the former than for the latter. (C) 2007 Elsevier Ltd. All rights reserved.
引用
收藏
页码:1482 / 1499
页数:18
相关论文
共 35 条
[1]  
AHMAD K, 2003, P 26 ANN INT ACM SIG
[2]  
ALLAN J, 1999, TECH REP 1999 SUMM W
[3]  
[Anonymous], 1997, Boostrap methods and their application
[4]  
[Anonymous], 2002, P 40 ANN M ASS COMP
[5]  
Brantingham P., 1995, European Journal on Criminal Policy and Research, V3, P5, DOI [10.1007/BF02242925, DOI 10.1007/BF02242925]
[6]  
Carletta J, 1996, COMPUT LINGUIST, V22, P249
[7]  
DANG HT, 2005, P DOC UND C DUC VANC
[8]  
DORR BJ, 2003, P HUM LANG TECHN N A
[9]   NEW METHODS IN AUTOMATIC EXTRACTING [J].
EDMUNDSON, HP .
JOURNAL OF THE ACM, 1969, 16 (02) :264-+
[10]  
EUGENIO BD, 2004, COMPUTATIONAL LINGUI, P95