Efficient sequential pattern mining with wildcards for keyphrase extraction

被引:48
|
作者
Xie, Fei [1 ,2 ]
Wu, Xindong [1 ,3 ]
Zhu, Xingquan [4 ]
机构
[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230009, Peoples R China
[2] Hefei Normal Univ, Dept Comp Sci & Technol, Hefei 230601, Peoples R China
[3] Univ Louisiana Lafayette, Sch Comp & Informat, Lafayette, LA 70503 USA
[4] Florida Atlantic Univ, Dept Comp & Elect Engn & Comp Sci, Boca Raton, FL 33431 USA
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
Document summarization; Keyphrase extraction; Sequential pattern mining; Wildcards; Classification; SYSTEM; RECOMMENDATION;
D O I
10.1016/j.knosys.2016.10.011
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A keyphrase (a multi-word unit) in a document denotes one or multiple keywords capturing a main topic of the underlying document. Finding good keyphrases of a document can quickly summarize knowledge for efficient decision making and benefit domains involving intensive text information. To date, existing keyphrase extraction methods cannot be customized to each specific document, mainly because their patterns used to form paraphrases are too restrictive and may not capture flexible keyword relationships inside the text. In this paper, we propose a sequential pattern mining based document-specific keyphrase extraction method. Our key innovation is to use wildcards (or gap constraints) to help extract sequential patterns, so the flexible wildcard constraints within a pattern can capture semantic relationships between words, and the system will have full flexibility to discover different types of sequential patterns as candidates for keyphrase extraction. To achieve the goal, we regard each single document as a sequential dataset, and propose an efficient algorithm to mine sequential patterns with wildcard and one-off conditions that allows important keyphrases to be captured during the mining process. For each extracted keyphrase candidate, we use some statistical pattern features to characterize it, and further collect all keyphrases from the document to form a training set. A supervised learning classifier is trained to identify keyphrases from a test document. Because our pattern mining and pattern characterization processes are customized to each single document, keyphases extracted from our method are highly specific for each document. Experimental results demonstrate that the proposed sequential pattern mining method outperforms existing pattern mining methods in both runtime performance and completeness. Comparisons on keyphrase benchmark datasets also confirm that the proposed document-specific keyphrase extraction method is effective in improving the quality of extracted keyphrases. (C) 2016 Elsevier B.V. All rights reserved.
引用
收藏
页码:27 / 39
页数:13
相关论文
共 50 条
  • [21] Malicious sequential pattern mining for automatic malware detection
    Fan, Yujie
    Ye, Yanfang
    Chen, Lifei
    EXPERT SYSTEMS WITH APPLICATIONS, 2016, 52 : 16 - 25
  • [22] Pattern Matching with Flexible Wildcards
    Xindong Wu
    Ji-Peng Qiang
    Fei Xie
    Journal of Computer Science and Technology, 2014, 29 : 740 - 750
  • [23] From sequential pattern mining to structured pattern mining: A pattern-growth approach
    Han, JW
    Pei, J
    Yan, XF
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2004, 19 (03) : 257 - 279
  • [24] From sequential pattern mining to structured pattern mining: A pattern-growth approach
    Jia-Wei Han
    Jian Pei
    Xi-Feng Yan
    Journal of Computer Science and Technology, 2004, 19 : 257 - 279
  • [25] Keyphrase Extraction for Technical Language Processing
    Dima, Alden
    Massey, Aaron
    JOURNAL OF RESEARCH OF THE NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY, 2022, 126
  • [26] AKEA: An Arabic Keyphrase Extraction Algorithm
    Amer, Eslam
    Foad, Khaled
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT SYSTEMS AND INFORMATICS 2016, 2017, 533 : 137 - 146
  • [27] Pattern Matching with Flexible Wildcards
    Wu, Xindong
    Qiang, Ji-Peng
    Xie, Fei
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2014, 29 (05) : 740 - 750
  • [28] Pattern Matching with Flexible Wildcards
    吴信东
    强继朋
    谢飞
    Journal of Computer Science & Technology, 2014, 29 (05) : 740 - 751
  • [29] Sequential Pattern Mining - Approaches and Algorithms
    Mooney, Carl H.
    Roddick, John F.
    ACM COMPUTING SURVEYS, 2013, 45 (02)
  • [30] SQUIRE: Sequential pattern mining with quantities
    Kim, Chulyun
    Lim, Jong-Hwa
    Ng, Raymond T.
    Shim, Kyuseok
    JOURNAL OF SYSTEMS AND SOFTWARE, 2007, 80 (10) : 1726 - 1745