CATAMARAN: A Cross-lingual Long Text Abstractive Summarization Dataset

被引：0

作者：

Chen, Zheng ^{[1
]}

Lin, Hongyu ^{[1
]}

机构：

[1] Univ Elect Sci & Technol China, 4,Sect 2,North Jianshe Rd, Chengdu, Peoples R China

来源：

LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2022年

关键词：

Abstractive summarization; Cross-lingual summarization; Long text summarization;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Cross-lingual summarization, which produces the summary in one language from a given source document in another language, could be extremely helpful for humans to obtain information across the world. However, it is still a little-explored task due to the lack of datasets. Recent studies are primarily based on pseudo-cross-lingual datasets obtained by translation. Such an approach would inevitably lead to the loss of information in the original document and introduce noise into the summary, thus hurting the overall performance. In this paper, we present CATAMARAN, the first high-quality cross-lingual long text abstractive summarization dataset. It contains about 20,000 parallel news articles and corresponding summaries, all written by humans. The average lengths of articles are 1133.65 for English articles and 2035.33 for Chinese articles, and the average lengths of the summaries are 26.59 and 70.05, respectively. We train and evaluate an mBART-based cross-lingual abstractive summarization model using our dataset. The result shows that, compared with mono-lingual systems, the cross-lingual abstractive summarization system could also achieve solid performance.

引用

页码：6932 / 6937

页数：6

共 50 条

[41] Abstractive Text Summarization Using Recurrent Neural Networks: Systematic Literature Review
Ngoko, Israel Christian Tchouyaa
Mukherjee, Amlan
Kabaso, Boniface
PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE ON INTELLECTUAL CAPITAL, KNOWLEDGE MANAGEMENT & ORGANISATIONAL LEARNING (ICICKM 2018), 2018, : 435 - 439
[42] Leveraging ParsBERT and Pretrained mT5 for Persian Abstractive Text Summarization
Farahani, Mehrdad
Gharachorloo, Mohammad
Manthouri, Mohammad
2021 26TH INTERNATIONAL COMPUTER CONFERENCE, COMPUTER SOCIETY OF IRAN (CSICC), 2021,
[43] Summary-aware attention for social media short text abstractive summarization
Wang, Qianlong
Ren, Jiangtao
NEUROCOMPUTING, 2021, 425 : 290 - 299
[44] A Vision Enhanced Framework for Indonesian Multimodal Abstractive Text-Image Summarization
Song, Yutao
Lin, Nankai
Li, Lingbao
Jiang, Shengyi
PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 61 - 66
[45] A novel semantic-enhanced generative adversarial network for abstractive text summarization
Tham Vo
Soft Computing, 2023, 27 : 6267 - 6280
[46] A novel semantic-enhanced generative adversarial network for abstractive text summarization
Vo, Tham
SOFT COMPUTING, 2023, 27 (10) : 6267 - 6280
[47] ViMs: a high-quality Vietnamese dataset for abstractive multi-document summarization
Tran, Nhi-Thao
Nghiem, Minh-Quoc
Nguyen, Nhung T. H.
Nguyen, Ngan Luu-Thuy
Van Chi, Nam
Dinh, Dien
LANGUAGE RESOURCES AND EVALUATION, 2020, 54 (04) : 893 - 920
[48] ViMs: a high-quality Vietnamese dataset for abstractive multi-document summarization
Nhi-Thao Tran
Minh-Quoc Nghiem
Nhung T. H. Nguyen
Ngan Luu-Thuy Nguyen
Nam Van Chi
Dien Dinh
Language Resources and Evaluation, 2020, 54 : 893 - 920
[49] Summarization of Lengthy Legal Documents via Abstractive Dataset Building: An Extract-then-Assign Approach
Jain, Deepali
Borah, Malaya Dutta
Biswas, Anupam
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 237
[50] A Multitask Cross-Lingual Summary Method Based on ABO Mechanism
Li, Qing
Wan, Weibing
Zhao, Yuming
APPLIED SCIENCES-BASEL, 2023, 13 (11):

← 1 2 3 4 5 →