CoMSum and SIBERT: A Dataset and Neural Model for Query-Based Multi-document Summarization

被引:6
作者
Kulkarni, Sayali [1 ]
Chammas, Sheide [1 ]
Zhu, Wan [1 ]
Sha, Fei [1 ]
Ie, Eugene [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
来源
DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT II | 2021年 / 12822卷
关键词
Extractive summarization; Abstractive summarization; Neural models; Transformers; Summarization dataset;
D O I
10.1007/978-3-030-86331-9_6
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Document summarization compress source document (s) into succinct and information-preserving text. A variant of this is query-based multi-document summarization (qmps) that targets summaries to providing specific informational needs, contextualized to the query. However, the progress in this is hindered by limited availability to large-scale datasets. In this work, we make two contributions. First, we propose an approach for automatically generated dataset for both extractive and abstractive summaries and release a version publicly. Second, we design a neural model SIBERT for extractive summarization that exploits the hierarchical nature of the input. It also infuses queries to extract query-specific summaries. We evaluate this model on CoMSum dataset showing significant improvement in performance. This should provide a baseline and enable using CoMSum for future research on qMDS.
引用
收藏
页码:84 / 98
页数:15
相关论文
共 44 条
[1]  
[Anonymous], 2004, NIPS
[2]  
Bajaj P., 2018, Ms marco: A human generated machine reading comprehension dataset
[3]  
Baumel Tal, 2018, CoRR abs/1801.07704
[4]  
Beltagy I., 2020, Longformer: The Long-Document Transformer, V2004, P05150, DOI DOI 10.48550/ARXIV.2004.05150
[5]  
Cer D, 2018, CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018): PROCEEDINGS OF SYSTEM DEMONSTRATIONS, P169
[6]  
Dang H.T., 2006, P WORKSH TASK FOC SU, P48
[7]  
Daumé H, 2006, COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, P305
[8]  
Deng Y, 2020, AAAI CONF ARTIF INTE, V34, P7651
[9]  
Diego Antognini B.F., 2020, LREC
[10]  
Dunn Matthew, 2017, Searchqa: A new q&a dataset augmented with context from a search engine