CATAMARAN: A Cross-lingual Long Text Abstractive Summarization Dataset

被引:0
|
作者
Chen, Zheng [1 ]
Lin, Hongyu [1 ]
机构
[1] Univ Elect Sci & Technol China, 4,Sect 2,North Jianshe Rd, Chengdu, Peoples R China
来源
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2022年
关键词
Abstractive summarization; Cross-lingual summarization; Long text summarization;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Cross-lingual summarization, which produces the summary in one language from a given source document in another language, could be extremely helpful for humans to obtain information across the world. However, it is still a little-explored task due to the lack of datasets. Recent studies are primarily based on pseudo-cross-lingual datasets obtained by translation. Such an approach would inevitably lead to the loss of information in the original document and introduce noise into the summary, thus hurting the overall performance. In this paper, we present CATAMARAN, the first high-quality cross-lingual long text abstractive summarization dataset. It contains about 20,000 parallel news articles and corresponding summaries, all written by humans. The average lengths of articles are 1133.65 for English articles and 2035.33 for Chinese articles, and the average lengths of the summaries are 26.59 and 70.05, respectively. We train and evaluate an mBART-based cross-lingual abstractive summarization model using our dataset. The result shows that, compared with mono-lingual systems, the cross-lingual abstractive summarization system could also achieve solid performance.
引用
收藏
页码:6932 / 6937
页数:6
相关论文
共 50 条
  • [21] A Novel Framework for Semantic Oriented Abstractive Text Summarization
    Moratanch, N.
    Chitrakala, S.
    JOURNAL OF WEB ENGINEERING, 2018, 17 (08): : 675 - 716
  • [22] Abstractive text summarization: State of the art, challenges, and improvements
    Shakil, Hassan
    Farooq, Ahmad
    Kalita, Jugal
    NEUROCOMPUTING, 2024, 603
  • [23] Graph-based abstractive biomedical text summarization
    Givchi, Azadeh
    Ramezani, Reza
    Baraani-Dastjerdi, Ahmad
    JOURNAL OF BIOMEDICAL INFORMATICS, 2022, 132
  • [24] Domain-Aware Abstractive Text Summarization for Medical Documents
    Gigioli, Paul
    Sagar, Nikhita
    Voyles, Joseph
    Rao, Anand
    PROCEEDINGS 2018 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2018, : 1155 - 1162
  • [25] Deep reinforcement and transfer learning for abstractive text summarization: A review
    Alomari, Ayham
    Idris, Norisma
    Sabri, Aznul Qalid Md
    Alsmadi, Izzat
    COMPUTER SPEECH AND LANGUAGE, 2022, 71
  • [26] Domain-Aware Abstractive Text Summarization for Medical Documents
    Gigioli, Paul
    Sagar, Nikhita
    Voyles, Joseph
    Rao, Anand
    PROCEEDINGS 2018 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2018, : 2338 - 2343
  • [27] Improving Transformer with Sequential Context Representations for Abstractive Text Summarization
    Cai, Tian
    Shen, Mengjun
    Peng, Huailiang
    Jiang, Lei
    Dai, Qiong
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING (NLPCC 2019), PT I, 2019, 11838 : 512 - 524
  • [28] Transformer-based Cross-Lingual Summarization using Multilingual Word Embeddings for English - Bahasa Indonesia
    Abka, Achmad F.
    Azizah, Kurniawati
    Jatmiko, Wisnu
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (12) : 636 - 645
  • [29] A Faster Method For Generating Chinese Text Summaries-Combining Extractive Summarization And Abstractive Summarization
    Yang, Wenchuan
    Gu, Tianyu
    Sui, Runqi
    2022 5TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING, MLNLP 2022, 2022, : 54 - 58
  • [30] Sequential Structured Fusion of Image and Text for Enhanced Multimodal Abstractive Summarization
    He, Rui
    Qi, Minjie
    Wang, Hongling
    Wang, Zhongqing
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT IV, NLPCC 2024, 2025, 15362 : 290 - 302