MEDIASUM: A Large-scale Media Interview Dataset for Dialogue Summarization

被引:0
作者
Zhu, Chenguang [1 ]
Liu, Yang [1 ]
Mei, Jie [1 ]
Zeng, Michael [1 ]
机构
[1] Microsoft Cognit Serv Res Grp, Redmond, WA USA
来源
2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021) | 2021年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper introduces MEDIASUM(1), a large-scale media interview dataset consisting of 463.6K transcripts with abstractive summaries. To create this dataset, we collect interview transcripts from NPR and CNN and employ the overview and topic descriptions as summaries. Compared with existing public corpora for dialogue summarization, our dataset is an order of magnitude larger and contains complex multi-party conversations from multiple domains. We conduct statistical analysis to demonstrate the unique positional bias exhibited in the transcripts of televised and radioed interviews. We also show that MEDIASUM can be used in transfer learning to improve a model's performance on other dialogue summarization tasks.
引用
收藏
页码:5927 / 5934
页数:8
相关论文
共 21 条
  • [1] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
  • [2] Budzianowski P, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P5016
  • [3] Chen JA, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P4106
  • [4] Dongwook Lee, 2019, arXiv
  • [5] Gliwa B., 2019, P 2 WORKSH NEW FRONT, DOI DOI 10.18653/V1/D19-5409
  • [6] Janin A, 2003, 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P364
  • [7] Kedzie C, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P1818
  • [8] Krishna Kundan, 2020, ARXIV200501795
  • [9] Lewis M, 2019, P 58 ANN M ASS COMP, DOI DOI 10.18653/V1/2020
  • [10] Lin C-Y, 2004, TEXT SUMMARIZATION B, P74