CATAMARAN: A Cross-lingual Long Text Abstractive Summarization Dataset

被引：0

作者：

Chen, Zheng ^{[1
]}

Lin, Hongyu ^{[1
]}

机构：

[1] Univ Elect Sci & Technol China, 4,Sect 2,North Jianshe Rd, Chengdu, Peoples R China

来源：

LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2022年

关键词：

Abstractive summarization; Cross-lingual summarization; Long text summarization;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Cross-lingual summarization, which produces the summary in one language from a given source document in another language, could be extremely helpful for humans to obtain information across the world. However, it is still a little-explored task due to the lack of datasets. Recent studies are primarily based on pseudo-cross-lingual datasets obtained by translation. Such an approach would inevitably lead to the loss of information in the original document and introduce noise into the summary, thus hurting the overall performance. In this paper, we present CATAMARAN, the first high-quality cross-lingual long text abstractive summarization dataset. It contains about 20,000 parallel news articles and corresponding summaries, all written by humans. The average lengths of articles are 1133.65 for English articles and 2035.33 for Chinese articles, and the average lengths of the summaries are 26.59 and 70.05, respectively. We train and evaluate an mBART-based cross-lingual abstractive summarization model using our dataset. The result shows that, compared with mono-lingual systems, the cross-lingual abstractive summarization system could also achieve solid performance.

引用

页码：6932 / 6937

页数：6

共 50 条

[1] WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization
Ladhak, Faisal
Durmus, Esin
Cardie, Claire
McKeown, Kathleen
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 4034 - 4048
[2] A Robust Abstractive System for Cross-Lingual Summarization
Ouyang, Jessica
Song, Boya
McKeown, Kathleen
2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 2025 - 2031
[3] Multi-Task Learning for Cross-Lingual Abstractive Summarization
Takase, Sho
Okazaki, Naoaki
2022 Language Resources and Evaluation Conference, LREC 2022, 2022, : 3008 - 3016
[4] Multi-Task Learning for Cross-Lingual Abstractive Summarization
Takase, Sho
Okazaki, Naoaki
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 3008 - 3016
[5] Dataset construction method of cross-lingual summarization based on filtering and text augmentation
Pan H.
Xi Y.
Wang L.
Nan Y.
Su Z.
Cao R.
PeerJ Computer Science, 2023, 9
[6] Dataset construction method of cross-lingual summarization based on filtering and text augmentation
Pan, Hangyu
Xi, Yaoyi
Wang, Ling
Nan, Yu
Su, Zhizhong
Cao, Rong
PEERJ COMPUTER SCIENCE, 2023, 9
[7] Cross-Lingual Speech-to-Text Summarization
Pontes, Elvys Linhares
Gonzalez-Gallardo, Carlos-Emiliano
Torres-Moreno, Juan-Manuel
Huet, Stephane
MULTIMEDIA AND NETWORK INFORMATION SYSTEMS, 2019, 833 : 385 - 395
[8] Cross-Lingual Korean Speech-to-Text Summarization
Yoon, HyoJeon
Dinh Tuyen Hoang
Ngoc Thanh Nguyen
Hwang, Dosam
INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2019, PT I, 2019, 11431 : 198 - 206
[9] Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation
Zhang, Ran
Ouni, Jihed
Eger, Steffen
COMPUTATIONAL LINGUISTICS, 2024, 50 (03) : 1001 - 1047
[10] Cross-lingual timeline summarization
Cagliero, Luca
La Quatra, Moreno
Garza, Paolo
Baralis, Elena
2021 IEEE FOURTH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND KNOWLEDGE ENGINEERING (AIKE 2021), 2021, : 45 - 53

← 1 2 3 4 5 →