OCCAMS - An Optimal Combinatorial Covering Algorithm for Multi-document Summarization

被引:20
作者
Davis, Sashka T. [1 ]
Conroy, John M. [1 ]
Schlesinger, Judith D. [1 ]
机构
[1] IDA Ctr Comp Sci, Bowie, MD USA
来源
12TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2012) | 2012年
关键词
D O I
10.1109/ICDMW.2012.50
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
OCCAMS is a new algorithm for the Multi-Document Summarization (MDS) problem. We use Latent Semantic Analysis (LSA) to produce term weights which identify the main theme(s) of a set of documents. These are used by our heuristic for extractive sentence selection which borrows techniques from combinatorial optimization to select a set of sentences such that the combined weight of the terms covered is maximized while redundancy is minimized. OCCAMS outperforms CLASSY11 on DUC/TAC data for nearly all years since 2005, where CLASSY11 is the best human-rated system of TAC 2011. OCCAMS also delivers higher ROUGE scores than all human-generated summaries for TAC 2011. We show that if the combinatorial component of OCCAMS, which computes the extractive summary, is given true weights of terms, then the quality of the summaries generated outperforms all human generated summaries for all years using ROUGE-2, ROUGE-SU4, and a coverage metric. We introduce this new metric based on term coverage and demonstrate that a simple bi-gram instantiation achieves a statistically significant higher Pearson correlation with overall responsiveness than ROUGE on the TAC data.
引用
收藏
页码:454 / 463
页数:10
相关论文
共 24 条
  • [1] Incorporating Prior Knowledge into a Transductive Ranking Algorithm for Multi-Document Summarization
    Amini, Massih-Reza
    Usunier, Nicolas
    [J]. PROCEEDINGS 32ND ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2009, : 704 - 705
  • [2] Anandkumar A., 2012, ARXIV12046703V2CSLG
  • [3] [Anonymous], 2006, DOC UND C RESP ASS I
  • [4] [Anonymous], 2001, Recent Advances in Natural language processing-RANLP'2001
  • [5] [Anonymous], 2001, Approximation algorithms
  • [6] [Anonymous], 2003, HLT NAACL
  • [7] The anatomy of a large-scale hypertextual Web search engine
    Brin, S
    Page, L
    [J]. COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7): : 107 - 117
  • [8] Conroy J., 2006, P COLING ACL MAIN C, P152
  • [9] Conroy J. M., 2011, TAC 2011 WORKSH
  • [10] Dumais S. T., 1994, TEXT RETR C TREC 2, P105