Inside Importance Factors of Graph-Based Keyword Extraction on Chinese Short Text

被引：4

作者：

Chen, Junjie ^{[1
,2
]}

Hou, Hongxu ^{[1
]}

Gao, Jing ^{[2
]}

机构：

[1] Inner Mongolia Univ, Coll Comp Sci, 235 West Univ Rd, Hohhot 010021, Inner Mongolia, Peoples R China

[2] Inner Mongolia Agr Univ, Coll Comp Sci & Informat Engn, 306 Zhao Wuda Rd, Hohhot 010018, Inner Mongolia, Peoples R China

来源：

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING | 2020年 / 19卷 / 05期

关键词：

Short text; keyword extraction; importance rank; KEYPHRASE EXTRACTION;

D O I：

10.1145/3388971

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Keywords are considered to be important words in the text and can provide a concise representation of the text. With the surge of unlabeled short text on the Internet, automatic keyword extraction task has proven useful in other information processing applications. Graph-based approaches are prevalent unsupervised models for this task. However, most of these methods emphasize the importance of the relation between words without considering other importance factors. Furthermore, when measuring the importance of a word in a text, the damping factor is set to 0.85 following PageRank. To the best of our knowledge, there is no existing work investigating the impact of the damping factor on the keyword extraction task. In addition, there are few publicly available labeled Chinese short text datasets for this task. In this article, we investigate the importance parts of words in a given document and propose an improved graph-based method for keyword extraction from short documents. Moreover, we analyze the impact of importance factors on performance. We also provide annotated long and short Chinese datasets for this task. The model is performed on Chinese and English datasets, and results show that our model obtains improvements in performance over the previous unsupervised models on short documents. Comparative experiments show that the damping factor is related to the text length, which is neglected in traditional methods.

引用

页数：15

共 52 条

[1]

Barker K, 2000, LECT NOTES ARTIF INT, V1822, P40

[2] NE-Rank: A Novel Graph-based Keyphrase Extraction in Twitter [J].

Bellaachia, Abdelghani ;

Al-Dhelaan, Mohammed .

2012 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY (WI-IAT 2012), VOL 1, 2012, :372-379

[3] Latent Dirichlet allocation [J].

Blei, DM ;

Ng, AY ;

Jordan, MI .

JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022

[4]

Boudin F., 2016, P COLING 2016 26 INT, P69

[5]

Bougouin A., 2013, INT JOINT C NAT LANG, P543

[6]

Fan Yang, 2016, Web Technologies and Applications. 18th Asia-Pacific Web Conference, APWeb 2016. Proceedings: LNCS 9932, P474, DOI 10.1007/978-3-319-45817-5_49

[7] PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents [J].

Florescu, Corina ;

Caragea, Cornelia .

PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, :1105-1115

[8]

Gershman A, 2013, ARXIV13064608

[9]

Gu JT, 2016, PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, P1631

[10]

Hasan K. S., 2010, P 23 INT C COMP LING, P365, DOI DOI 10.5555/1944566.1944608

← 1 2 3 4 5 6 →