DANEWSROOM: A Large-scale Danish Summarisation Dataset

被引:0
作者
Varab, Daniel [1 ]
Schluter, Natalie [1 ]
机构
[1] IT Univ Copenhagen, Copenhagen, Denmark
来源
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020) | 2020年
关键词
automatic text summarisation; data collection; danish corpus;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Dataset development for automatic summarisation systems is notoriously English-oriented. In this paper we present the first large-scale non-English language dataset specifically curated for automatic summarisation. The document-summary pairs are news articles and manually written summaries in the Danish language. There has previously been no work done to establish a Danish summarisation dataset, nor any published work on the automatic summarisation of Danish. We provide therefore the first automatic summarisation dataset for the Danish language (large-scale or otherwise). To support the comparison of future automatic summarisation systems for Danish, we include system performance on this dataset of strong well-established unsupervised baseline systems, together with an oracle extractive summariser, which is the first account of automatic summarisation system performance for Danish. Finally, we make all code for automatically acquiring the data freely available and make explicit how this technology can easily be adapted in order to acquire automatic summarisation datasets for further languages.
引用
收藏
页码:6731 / 6739
页数:9
相关论文
共 21 条
  • [1] Barrios F., 2016, ABS160203606 CORR, P65
  • [2] Chen DQ, 2016, PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, P2358
  • [3] Cheng JP, 2016, PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, P484
  • [4] Submodular Maximization with Matroid and Packing Constraints in Parallel
    Ene, Alina
    Nguyen, Huy L.
    Vladu, Adrian
    [J]. PROCEEDINGS OF THE 51ST ANNUAL ACM SIGACT SYMPOSIUM ON THEORY OF COMPUTING (STOC '19), 2019, : 90 - 101
  • [5] Gillick D., 2009, P WORKSH INT LIN PRO, P10, DOI DOI 10.3115/1611638.1611640
  • [6] Grusky M., 2018, P 2018 C N AM CHAPT, V1, P708, DOI DOI 10.18653/V1/N18-1065
  • [7] Hermann KM, 2015, 29 ANN C NEURAL INFO, V28
  • [8] Hong K, 2014, LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P1608
  • [9] Honnibal M., 2017, SPACY 2 NATURAL LANG, V7, P411, DOI DOI 10.3233/978-1-60750-588-4-1080
  • [10] Lee K., 2018, NAACL HLT, V2, P687, DOI [10.18653/v1/N18-2108, DOI 10.18653/V1/N18-2108]