CATAMARAN: A Cross-lingual Long Text Abstractive Summarization Dataset

被引：0

作者：

Chen, Zheng ^{[1
]}

Lin, Hongyu ^{[1
]}

机构：

[1] Univ Elect Sci & Technol China, 4,Sect 2,North Jianshe Rd, Chengdu, Peoples R China

来源：

LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2022年

关键词：

Abstractive summarization; Cross-lingual summarization; Long text summarization;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Cross-lingual summarization, which produces the summary in one language from a given source document in another language, could be extremely helpful for humans to obtain information across the world. However, it is still a little-explored task due to the lack of datasets. Recent studies are primarily based on pseudo-cross-lingual datasets obtained by translation. Such an approach would inevitably lead to the loss of information in the original document and introduce noise into the summary, thus hurting the overall performance. In this paper, we present CATAMARAN, the first high-quality cross-lingual long text abstractive summarization dataset. It contains about 20,000 parallel news articles and corresponding summaries, all written by humans. The average lengths of articles are 1133.65 for English articles and 2035.33 for Chinese articles, and the average lengths of the summaries are 26.59 and 70.05, respectively. We train and evaluate an mBART-based cross-lingual abstractive summarization model using our dataset. The result shows that, compared with mono-lingual systems, the cross-lingual abstractive summarization system could also achieve solid performance.

引用

页码：6932 / 6937

页数：6

共 50 条

[21] A Novel Framework for Semantic Oriented Abstractive Text Summarization
Moratanch, N.
Chitrakala, S.
JOURNAL OF WEB ENGINEERING, 2018, 17 (08): : 675 - 716
[22] Abstractive text summarization: State of the art, challenges, and improvements
Shakil, Hassan
Farooq, Ahmad
Kalita, Jugal
NEUROCOMPUTING, 2024, 603
[23] Graph-based abstractive biomedical text summarization
Givchi, Azadeh
Ramezani, Reza
Baraani-Dastjerdi, Ahmad
JOURNAL OF BIOMEDICAL INFORMATICS, 2022, 132
[24] Domain-Aware Abstractive Text Summarization for Medical Documents
Gigioli, Paul
Sagar, Nikhita
Voyles, Joseph
Rao, Anand
PROCEEDINGS 2018 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2018, : 1155 - 1162
[25] Deep reinforcement and transfer learning for abstractive text summarization: A review
Alomari, Ayham
Idris, Norisma
Sabri, Aznul Qalid Md
Alsmadi, Izzat
COMPUTER SPEECH AND LANGUAGE, 2022, 71
[26] Domain-Aware Abstractive Text Summarization for Medical Documents
Gigioli, Paul
Sagar, Nikhita
Voyles, Joseph
Rao, Anand
PROCEEDINGS 2018 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2018, : 2338 - 2343
[27] Improving Transformer with Sequential Context Representations for Abstractive Text Summarization
Cai, Tian
Shen, Mengjun
Peng, Huailiang
Jiang, Lei
Dai, Qiong
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING (NLPCC 2019), PT I, 2019, 11838 : 512 - 524
[28] Transformer-based Cross-Lingual Summarization using Multilingual Word Embeddings for English - Bahasa Indonesia
Abka, Achmad F.
Azizah, Kurniawati
Jatmiko, Wisnu
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (12) : 636 - 645
[29] A Faster Method For Generating Chinese Text Summaries-Combining Extractive Summarization And Abstractive Summarization
Yang, Wenchuan
Gu, Tianyu
Sui, Runqi
2022 5TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING, MLNLP 2022, 2022, : 54 - 58
[30] Sequential Structured Fusion of Image and Text for Enhanced Multimodal Abstractive Summarization
He, Rui
Qi, Minjie
Wang, Hongling
Wang, Zhongqing
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT IV, NLPCC 2024, 2025, 15362 : 290 - 302

← 1 2 3 4 5 →