CoMSum and SIBERT: A Dataset and Neural Model for Query-Based Multi-document Summarization

被引：6

作者：

Kulkarni, Sayali ^{[1
]}

Chammas, Sheide ^{[1
]}

Zhu, Wan ^{[1
]}

Sha, Fei ^{[1
]}

Ie, Eugene ^{[1
]}

机构：

[1] Google Res, Mountain View, CA 94043 USA

来源：

DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT II | 2021年 / 12822卷

关键词：

Extractive summarization; Abstractive summarization; Neural models; Transformers; Summarization dataset;

D O I：

10.1007/978-3-030-86331-9_6

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Document summarization compress source document (s) into succinct and information-preserving text. A variant of this is query-based multi-document summarization (qmps) that targets summaries to providing specific informational needs, contextualized to the query. However, the progress in this is hindered by limited availability to large-scale datasets. In this work, we make two contributions. First, we propose an approach for automatically generated dataset for both extractive and abstractive summaries and release a version publicly. Second, we design a neural model SIBERT for extractive summarization that exploits the hierarchical nature of the input. It also infuses queries to extract query-specific summaries. We evaluate this model on CoMSum dataset showing significant improvement in performance. This should provide a baseline and enable using CoMSum for future research on qMDS.

引用

页码：84 / 98

页数：15

共 44 条

[1]

[Anonymous], 2004, NIPS

[2]

Bajaj P., 2018, Ms marco: A human generated machine reading comprehension dataset

[3]

Baumel Tal, 2018, CoRR abs/1801.07704

[4]

Beltagy I., 2020, Longformer: The Long-Document Transformer, V2004, P05150, DOI DOI 10.48550/ARXIV.2004.05150

[5]

Cer D, 2018, CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018): PROCEEDINGS OF SYSTEM DEMONSTRATIONS, P169

[6]

Dang H.T., 2006, P WORKSH TASK FOC SU, P48

[7]

Daumé H, 2006, COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, P305

[8]

Deng Y, 2020, AAAI CONF ARTIF INTE, V34, P7651

[9]

Diego Antognini B.F., 2020, LREC

[10]

Dunn Matthew, 2017, Searchqa: A new q&a dataset augmented with context from a search engine

← 1 2 3 4 5 →