CATAMARAN: A Cross-lingual Long Text Abstractive Summarization Dataset

被引:0
|
作者
Chen, Zheng [1 ]
Lin, Hongyu [1 ]
机构
[1] Univ Elect Sci & Technol China, 4,Sect 2,North Jianshe Rd, Chengdu, Peoples R China
来源
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2022年
关键词
Abstractive summarization; Cross-lingual summarization; Long text summarization;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Cross-lingual summarization, which produces the summary in one language from a given source document in another language, could be extremely helpful for humans to obtain information across the world. However, it is still a little-explored task due to the lack of datasets. Recent studies are primarily based on pseudo-cross-lingual datasets obtained by translation. Such an approach would inevitably lead to the loss of information in the original document and introduce noise into the summary, thus hurting the overall performance. In this paper, we present CATAMARAN, the first high-quality cross-lingual long text abstractive summarization dataset. It contains about 20,000 parallel news articles and corresponding summaries, all written by humans. The average lengths of articles are 1133.65 for English articles and 2035.33 for Chinese articles, and the average lengths of the summaries are 26.59 and 70.05, respectively. We train and evaluate an mBART-based cross-lingual abstractive summarization model using our dataset. The result shows that, compared with mono-lingual systems, the cross-lingual abstractive summarization system could also achieve solid performance.
引用
收藏
页码:6932 / 6937
页数:6
相关论文
共 50 条
  • [41] Abstractive Text Summarization Using Recurrent Neural Networks: Systematic Literature Review
    Ngoko, Israel Christian Tchouyaa
    Mukherjee, Amlan
    Kabaso, Boniface
    PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE ON INTELLECTUAL CAPITAL, KNOWLEDGE MANAGEMENT & ORGANISATIONAL LEARNING (ICICKM 2018), 2018, : 435 - 439
  • [42] Leveraging ParsBERT and Pretrained mT5 for Persian Abstractive Text Summarization
    Farahani, Mehrdad
    Gharachorloo, Mohammad
    Manthouri, Mohammad
    2021 26TH INTERNATIONAL COMPUTER CONFERENCE, COMPUTER SOCIETY OF IRAN (CSICC), 2021,
  • [43] Summary-aware attention for social media short text abstractive summarization
    Wang, Qianlong
    Ren, Jiangtao
    NEUROCOMPUTING, 2021, 425 : 290 - 299
  • [44] A Vision Enhanced Framework for Indonesian Multimodal Abstractive Text-Image Summarization
    Song, Yutao
    Lin, Nankai
    Li, Lingbao
    Jiang, Shengyi
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 61 - 66
  • [45] A novel semantic-enhanced generative adversarial network for abstractive text summarization
    Tham Vo
    Soft Computing, 2023, 27 : 6267 - 6280
  • [46] A novel semantic-enhanced generative adversarial network for abstractive text summarization
    Vo, Tham
    SOFT COMPUTING, 2023, 27 (10) : 6267 - 6280
  • [47] ViMs: a high-quality Vietnamese dataset for abstractive multi-document summarization
    Tran, Nhi-Thao
    Nghiem, Minh-Quoc
    Nguyen, Nhung T. H.
    Nguyen, Ngan Luu-Thuy
    Van Chi, Nam
    Dinh, Dien
    LANGUAGE RESOURCES AND EVALUATION, 2020, 54 (04) : 893 - 920
  • [48] ViMs: a high-quality Vietnamese dataset for abstractive multi-document summarization
    Nhi-Thao Tran
    Minh-Quoc Nghiem
    Nhung T. H. Nguyen
    Ngan Luu-Thuy Nguyen
    Nam Van Chi
    Dien Dinh
    Language Resources and Evaluation, 2020, 54 : 893 - 920
  • [49] Summarization of Lengthy Legal Documents via Abstractive Dataset Building: An Extract-then-Assign Approach
    Jain, Deepali
    Borah, Malaya Dutta
    Biswas, Anupam
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 237
  • [50] A Multitask Cross-Lingual Summary Method Based on ABO Mechanism
    Li, Qing
    Wan, Weibing
    Zhao, Yuming
    APPLIED SCIENCES-BASEL, 2023, 13 (11):