CATAMARAN: A Cross-lingual Long Text Abstractive Summarization Dataset

被引:0
|
作者
Chen, Zheng [1 ]
Lin, Hongyu [1 ]
机构
[1] Univ Elect Sci & Technol China, 4,Sect 2,North Jianshe Rd, Chengdu, Peoples R China
来源
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2022年
关键词
Abstractive summarization; Cross-lingual summarization; Long text summarization;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Cross-lingual summarization, which produces the summary in one language from a given source document in another language, could be extremely helpful for humans to obtain information across the world. However, it is still a little-explored task due to the lack of datasets. Recent studies are primarily based on pseudo-cross-lingual datasets obtained by translation. Such an approach would inevitably lead to the loss of information in the original document and introduce noise into the summary, thus hurting the overall performance. In this paper, we present CATAMARAN, the first high-quality cross-lingual long text abstractive summarization dataset. It contains about 20,000 parallel news articles and corresponding summaries, all written by humans. The average lengths of articles are 1133.65 for English articles and 2035.33 for Chinese articles, and the average lengths of the summaries are 26.59 and 70.05, respectively. We train and evaluate an mBART-based cross-lingual abstractive summarization model using our dataset. The result shows that, compared with mono-lingual systems, the cross-lingual abstractive summarization system could also achieve solid performance.
引用
收藏
页码:6932 / 6937
页数:6
相关论文
共 50 条
  • [1] WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization
    Ladhak, Faisal
    Durmus, Esin
    Cardie, Claire
    McKeown, Kathleen
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 4034 - 4048
  • [2] A Robust Abstractive System for Cross-Lingual Summarization
    Ouyang, Jessica
    Song, Boya
    McKeown, Kathleen
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 2025 - 2031
  • [3] Multi-Task Learning for Cross-Lingual Abstractive Summarization
    Takase, Sho
    Okazaki, Naoaki
    2022 Language Resources and Evaluation Conference, LREC 2022, 2022, : 3008 - 3016
  • [4] Multi-Task Learning for Cross-Lingual Abstractive Summarization
    Takase, Sho
    Okazaki, Naoaki
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 3008 - 3016
  • [5] Dataset construction method of cross-lingual summarization based on filtering and text augmentation
    Pan H.
    Xi Y.
    Wang L.
    Nan Y.
    Su Z.
    Cao R.
    PeerJ Computer Science, 2023, 9
  • [6] Dataset construction method of cross-lingual summarization based on filtering and text augmentation
    Pan, Hangyu
    Xi, Yaoyi
    Wang, Ling
    Nan, Yu
    Su, Zhizhong
    Cao, Rong
    PEERJ COMPUTER SCIENCE, 2023, 9
  • [7] Cross-Lingual Speech-to-Text Summarization
    Pontes, Elvys Linhares
    Gonzalez-Gallardo, Carlos-Emiliano
    Torres-Moreno, Juan-Manuel
    Huet, Stephane
    MULTIMEDIA AND NETWORK INFORMATION SYSTEMS, 2019, 833 : 385 - 395
  • [8] Cross-Lingual Korean Speech-to-Text Summarization
    Yoon, HyoJeon
    Dinh Tuyen Hoang
    Ngoc Thanh Nguyen
    Hwang, Dosam
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2019, PT I, 2019, 11431 : 198 - 206
  • [9] Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation
    Zhang, Ran
    Ouni, Jihed
    Eger, Steffen
    COMPUTATIONAL LINGUISTICS, 2024, 50 (03) : 1001 - 1047
  • [10] Cross-lingual timeline summarization
    Cagliero, Luca
    La Quatra, Moreno
    Garza, Paolo
    Baralis, Elena
    2021 IEEE FOURTH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND KNOWLEDGE ENGINEERING (AIKE 2021), 2021, : 45 - 53