DACSA: A large-scale Dataset for Automatic summarization of Catalan and Spanish newspaper Articles

被引:0
作者
Segarra, Encarna [1 ]
Ahuir, Vicent [1 ]
Hurtado, Lluis-F [1 ]
Angel Gonzalez, Jose [1 ]
机构
[1] Univ Politecn Valencia, VRAIN Valencian Res Inst Artificial Intelligence, Valencia, Spain
来源
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES | 2022年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The application of supervised methods to automatic summarization requires the availability of adequate corpora consisting of a set of document-summary pairs. As in most Natural Language Processing tasks, the great majority of available datasets for summarization are in English, making it difficult to develop automatic summarization models for other languages. Although Spanish is gradually forming part of some recent summarization corpora, it is not the same for minority languages such as Catalan. In this work, we describe the construction of a corpus of Catalan and Spanish newspapers, the Dataset for Automatic summarization of Catalan and Spanish newspaper Articles (DACSA) corpus. It is a high-quality large-scale corpus that can be used to train summarization models for Catalan and Spanish. We have carried out an analysis of the corpus, both in terms of the style of the summaries and the difficulty of the summarization task. In particular, we have used a set of well-known metrics in the summarization field in order to characterize the corpus. Additionally, we have evaluated the performance of some extractive and abstractive summarization systems on the DACSA corpus for benchmarking purposes.
引用
收藏
页码:5931 / 5943
页数:13
相关论文
共 43 条
[1]  
[Anonymous], 2018, P 2018 C EMP METH NA
[2]  
Barrios Federico., 2016, CoRR
[3]  
Bommasani R, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P8075
[4]  
Chen YC, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P675
[5]  
Cheng JP, 2016, PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, P484
[6]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]  
Gonzalez J.-A., 2019, APPL SCI, V9
[8]   Extractive summarization using siamese hierarchical transformer encoders [J].
Gonzalez, Jose Angel ;
Segarra, Encarna ;
Garcia-Granada, Fernando ;
Sanchis, Emilio ;
Hurtado, Lluis-F. .
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 39 (02) :2409-2419
[9]   Siamese hierarchical attention networks for extractive summarization [J].
Gonzalez, Jose-Angel ;
Segarra, Encarna ;
Garcia-Granada, Fernando ;
Sanchis, Emilio ;
Hurtado, Lluis-F .
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2019, 36 (05) :4599-4607
[10]  
Grusky M., 2018, Long Papers, V1, P708, DOI [10.18653/v1/N18, DOI 10.18653/V1/N18-1065]