Inside Importance Factors of Graph-Based Keyword Extraction on Chinese Short Text

被引:4
作者
Chen, Junjie [1 ,2 ]
Hou, Hongxu [1 ]
Gao, Jing [2 ]
机构
[1] Inner Mongolia Univ, Coll Comp Sci, 235 West Univ Rd, Hohhot 010021, Inner Mongolia, Peoples R China
[2] Inner Mongolia Agr Univ, Coll Comp Sci & Informat Engn, 306 Zhao Wuda Rd, Hohhot 010018, Inner Mongolia, Peoples R China
关键词
Short text; keyword extraction; importance rank; KEYPHRASE EXTRACTION;
D O I
10.1145/3388971
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Keywords are considered to be important words in the text and can provide a concise representation of the text. With the surge of unlabeled short text on the Internet, automatic keyword extraction task has proven useful in other information processing applications. Graph-based approaches are prevalent unsupervised models for this task. However, most of these methods emphasize the importance of the relation between words without considering other importance factors. Furthermore, when measuring the importance of a word in a text, the damping factor is set to 0.85 following PageRank. To the best of our knowledge, there is no existing work investigating the impact of the damping factor on the keyword extraction task. In addition, there are few publicly available labeled Chinese short text datasets for this task. In this article, we investigate the importance parts of words in a given document and propose an improved graph-based method for keyword extraction from short documents. Moreover, we analyze the impact of importance factors on performance. We also provide annotated long and short Chinese datasets for this task. The model is performed on Chinese and English datasets, and results show that our model obtains improvements in performance over the previous unsupervised models on short documents. Comparative experiments show that the damping factor is related to the text length, which is neglected in traditional methods.
引用
收藏
页数:15
相关论文
共 52 条
  • [1] Barker K, 2000, LECT NOTES ARTIF INT, V1822, P40
  • [2] NE-Rank: A Novel Graph-based Keyphrase Extraction in Twitter
    Bellaachia, Abdelghani
    Al-Dhelaan, Mohammed
    [J]. 2012 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY (WI-IAT 2012), VOL 1, 2012, : 372 - 379
  • [3] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
  • [4] Boudin F., 2016, P 26 INT C COMP LING, P69
  • [5] Bougouin A., 2013, P 6 INT JOINT C NAT, P543
  • [6] Fan Yang, 2016, Web Technologies and Applications. 18th Asia-Pacific Web Conference, APWeb 2016. Proceedings: LNCS 9932, P474, DOI 10.1007/978-3-319-45817-5_49
  • [7] PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents
    Florescu, Corina
    Caragea, Cornelia
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 1105 - 1115
  • [8] Gershman A, 2013, ARXIV13064608
  • [9] Gu JT, 2016, PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, P1631
  • [10] Hasan K. S., 2010, Coling 2010: Posters, P365