A new dataset for French and multilingual keyphrase generation

被引:0
作者
Piedboeuf, Frederic [1 ]
Langlais, Philippe [1 ]
机构
[1] Univ Montreal, RALI, Diro, Montreal, PQ, Canada
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022 | 2022年
关键词
EXTRACTION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Keyphrases are key components in efficiently dealing with the everincreasing amount of information present on the internet. While there are many recent papers on English keyphrase generation, keyphrase generation for other languages remains vastly understudied, mostly due to the absence of datasets. To address this, we present a novel dataset called Papyrus, composed of 16427 pairs of abstracts and keyphrases. We release four versions of this dataset, corresponding to different subtasks. Papyrus-e considers only English keyphrases, Papyrus-f considers French keyphrases, Papyrus-m considers keyphrase generation in any language (mostly French and English), and Papyrus-a considers keyphrase generation in several languages. We train a state-of-the-art model on all four tasks and show that they lead to better results for non-English languages, with an average improvement of 14.2% on keyphrase extraction and 2.0% on generation. We also show an improvement of 0.4% on extraction and 0.7% on generation over English state-of-the-art results by concatenating Papyrus-e with the Kp20K training set.
引用
收藏
页数:14
相关论文
共 48 条
  • [1] A Two-Level Keyphrase Extraction Approach
    Ali, Chedi Bechikh
    Wang, Rui
    Haddad, Hatem
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2015), PT II, 2015, 9042 : 390 - 401
  • [2] [Anonymous], 2011, IJCNLP
  • [3] Aquino GO, 2015, J COMPUT SCI TECHNOL, V15, P55
  • [4] On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
    Bender, Emily M.
    Gebru, Timnit
    McMillan-Major, Angelina
    Shmitchell, Shmargaret
    [J]. PROCEEDINGS OF THE 2021 ACM CONFERENCE ON FAIRNESS, ACCOUNTABILITY, AND TRANSPARENCY, FACCT 2021, 2021, : 610 - 623
  • [5] Bougouin A., 2013, P 6 INT JOINT C NAT, P543
  • [6] YAKE! Keyword extraction from single documents using multiple local features
    Campos, Ricardo
    Mangaravite, Vitor
    Pasquali, Arian
    Jorge, Alipio
    Nunes, Celia
    Jatowt, Adam
    [J]. INFORMATION SCIENCES, 2020, 509 : 257 - 289
  • [7] Çano E, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P666
  • [8] Chowdhury MFM, 2022, Arxiv, DOI [arXiv:2201.05302, DOI 10.48550/ARXIV.2201.05302]
  • [9] Daille B., 2019, arXiv
  • [10] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171