Multi-document summarization through subsection-aware pre-training objectives

被引：0

作者：

Wang, Xianchuan ^{[1
,2
]}

Lu, Ben ^{[1
]}

Ming, Wenkai ^{[1
]}

Wang, Xianchao ^{[1
,2
]}

机构：

[1] Fuyang Normal Univ, Sch Comp & Informat Engn, Fuyang, Anhui, Peoples R China

[2] Fuyang Normal Univ, Anhui Engn Res Ctr Intelligent Comp & Informat Inn, Fuyang, Anhui, Peoples R China

来源：

JOURNAL OF SUPERCOMPUTING | 2025年 / 81卷 / 08期

关键词：

Multi-document summarization; Pre-training objective; Abstractive summarization; Large language model; Document dependency;

D O I：

10.1007/s11227-025-07504-3

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Pre-training language models are increasingly being utilized for multi-document summarization (MDS) tasks. However, pre-training typically necessitates large-scale, domain-specific data. Most MDS pre-training models process multiple documents as a single document, while ignoring the subsection relationship information among different documents. In this paper, we focus on pre-training objectives for MDS, which assume key information appears across multiple documents that point to the same topic. We segment each document into several subsections based on text structure features. Then, we compare the subsections of different documents, extract key sentences from these subsections using text similarity, and generate a proxy summary. Experimental results on Multi-News and WikiSum demonstrate that our proposed model outperforms compared MDS models in terms of ROUGE scores and maintains strong performance even with limited data samples.

引用

页数：20

共 36 条

[1]

Beltagy I, 2020, Arxiv, DOI [arXiv:2004.05150, 10.48550/arXiv.2004.05150]

[2]

Bennett P., 2023, Query-focused multi-document summarization, P71, DOI [10.1007/978-3-031-23080-64, DOI 10.1007/978-3-031-23080-64]

[3]

Chen TC, 2024, Arxiv, DOI arXiv:2407.13089

[4]

Cui P, 2021, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, P1463

[5]

Fabbri AR, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P1074

[6]

Hewapathirana K, 2023, 2023 IEEE 17 INT C I, P19, DOI [10.1109/iciis58898.2023.10253581, DOI 10.1109/ICIIS58898.2023.10253581]

[7]

Ihsan U, 2023, Arxiv, DOI arXiv:2312.12915

[8]

Liu PJ, 2018, Arxiv, DOI [arXiv:1801.10198, DOI 10.48550/ARXIV.1801.10198]

[9]

Joshi M, 2020, Arxiv, DOI arXiv:1907.10529

[10] CovSumm: an unsupervised transformer-cum-graph-based hybrid document summarization model for CORD-19 [J].

Karotia, Akanksha ;

Susan, Seba .

JOURNAL OF SUPERCOMPUTING, 2023, 79 (14) :16328-16350

← 1 2 3 4 →