KP-Rank: a semantic-based unsupervised approach for keyphrase extraction from text data

被引:11
作者
Aman, Muhammad [1 ,2 ]
Abdulkadir, Said Jadid [3 ]
Aziz, Izzatdin Abdul [3 ]
Alhussian, Hitham [3 ]
Ullah, Israr [4 ]
机构
[1] Univ Teknol Petronas, Dept Comp & Informat Sci, Seri Iskandar, Perak, Malaysia
[2] Natl Database & Registrat Author NADRA, Technol & Dev Directorate, Islamabad, Pakistan
[3] Univ Teknol Petronas, Dept Comp & Informat Sci, Ctr Res Data Sci CeRDaS, Seri Iskandar, Perak, Malaysia
[4] Virtual Univ Pakistan, Dept Comp Sci, Lahore, Pakistan
关键词
Keyphrase extraction; Key concept extraction; Information retrieval; Information extraction; Text mining; FREQUENCY; LSA;
D O I
10.1007/s11042-020-10215-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic key concept identification from text is the main challenging task in information extraction, information retrieval, digital libraries, ontology learning, and text analysis. The main difficulty lies in the issues with the text data itself, such as noise in text, diversity, scale of data, context dependency and word sense ambiguity. To cope with this challenge, numerous supervised and unsupervised approaches have been devised. The existing topical clustering-based approaches for keyphrase extraction are domain dependent and overlooks semantic similarity between candidate features while extracting the topical phrases. In this paper, a semantic based unsupervised approach (KP-Rank) is proposed for keyphrase extraction. In the proposed approach, we exploited Latent Semantic Analysis (LSA) and clustering techniques and a novel frequency-based algorithm for candidate ranking is introduced which considers locality-based sentence, paragraph and section frequencies. To evaluate the performance of the proposed method, three benchmark datasets (i.e. Inspec, 500N-KPCrowed and SemEval-2010) from different domains are used. The experimental results show that overall, the KP-Rank achieved significant improvements over the existing approaches on the selected performance measures.
引用
收藏
页码:12469 / 12506
页数:38
相关论文
共 67 条
[1]  
Adar E, 2015, PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1, P606
[2]   Key Concept Identification: A Sentence Parse Tree-Based Technique for Candidate Feature Extraction From Unstructured Texts [J].
Aman, Muhammad ;
Said, Abas Bin Md ;
Kadir, Said Jadid Abdul ;
Ullah, Israr .
IEEE ACCESS, 2018, 6 :60403-60413
[3]   Key Concept Identification: A Comprehensive Analysis of Frequency and Topical Graph-Based Approaches [J].
Aman, Muhammad ;
Said, Abas Bin Md ;
Kadir, Said Jadid Abdul ;
Ullah, Israr .
INFORMATION, 2018, 9 (05)
[4]  
[Anonymous], 2014, P SOFTW ENG RES C
[5]  
Barker K, 2000, LECT NOTES ARTIF INT, V1822, P40
[6]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[7]  
Boudin F., 2016, P 26 INT C COMP LING, P69
[8]  
Bougouin Adrien, 2013, IJCNLP, P543
[9]   The anatomy of a large-scale hypertextual Web search engine [J].
Brin, S ;
Page, L .
COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7) :107-117
[10]  
Chandu Khyathi, 2017, BioNLP 2017, P58, DOI 10.18653/v1/W17-2307