SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-Trained Language Model

被引:85
作者
Sun, Yi [1 ]
Qiu, Hangping [1 ]
Zheng, Yu [2 ]
Wang, Zhongwei [1 ]
Zhang, Chaoran [1 ]
机构
[1] Army Engn Univ PLA, Command & Control Engn Coll, Nanjing 210001, Peoples R China
[2] MIIT, Res Inst 5, Ceprei Nanjing Lab, Nanjing 211800, Peoples R China
关键词
Keyphrase extraction; pre-trained language model; sentence embeddings; position-biased weight; SIFRank;
D O I
10.1109/ACCESS.2020.2965087
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the age of social media, faced with a huge amount of knowledge and information, accurate and effective keyphrase extraction methods are needed to be applied in information retrieval and natural language processing. It is difficult for traditional keyphrase extraction models to contain a large amount of external knowledge information, but with the rise of pre-trained language models, there is a new way to solve this problem. Based on the above background, we propose a new baseline for unsupervised keyphrase extraction based on pre-trained language model called SIFRank. SIFRank combines sentence embedding model SIF and autoregressive pre-trained language model ELMo, and it has the best performance in keyphrase extraction for short documents. We speed up SIFRank while maintaining its accuracy by document segmentation and contextual word embeddings alignment. For long documents, we upgrade SIFRank to SIFRank+ by position-biased weight, greatly improve its performance on long documents. Compared to other baseline models, our model achieves state-of-the-art level on three widely used datasets.
引用
收藏
页码:10896 / 10906
页数:11
相关论文
共 31 条
[1]  
[Anonymous], 1999, SIDLWP19990120
[2]  
[Anonymous], 2018, C EMPIRICAL METHODS
[3]  
[Anonymous], 2004, P EMNLP
[4]  
[Anonymous], ARXIV190102860
[5]  
Arora S., 2017, INT C LEARNING REPRE
[6]  
Augenstein I., 2017, P 11 INT WORKSHOP SE, P546, DOI [DOI 10.18653/V1/S17-2091, 10.18653/v1/S17-2091]
[7]  
Bennani-Smires K., 2018, P 22 C COMP NAT LANG, P221, DOI [10.18653, DOI 10.18653/V1/K18-1022]
[8]  
Boudin F., 2018, P 2018 C N AM CHAPT, V2, P667, DOI DOI 10.18653/V1/N18-2105
[9]  
Bougouin A., 2013, INT JOINT C NAT LANG, P543
[10]   YAKE! Collection-Independent Automatic Keyword Extractor [J].
Campos, Ricardo ;
Mangaravite, Vitor ;
Pasquali, Arian ;
Jorge, Alipio Mario ;
Nunes, Celia ;
Jatowt, Adam .
ADVANCES IN INFORMATION RETRIEVAL (ECIR 2018), 2018, 10772 :806-810