Utilizing Recurrent Neural Network for topic discovery in short text scenarios

被引：10

作者：

Lu, Heng-Yang ^{[1
]}

Kang, Ning ^{[1
]}

Li, Yun ^{[1
]}

Zhan, Qian-Yi ^{[2
]}

Xie, Jun-Yuan ^{[1
]}

Wang, Chong-Jun ^{[1
]}

机构：

[1] Nanjing Univ, Dept Comp Sci & Technol, Natl Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China

[2] Jiangnan Univ, Sch Digital Media, Wuxi, Jiangsu, Peoples R China

来源：

INTELLIGENT DATA ANALYSIS | 2019年 / 23卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Topic model; short text; Recurrent Neural Network; bigrams; MODEL;

D O I：

10.3233/IDA-183842

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The volume of short text data increases rapidly these years. Data examples include tweets and online Q&A pairs. It is essential to organize and summarize these data automatically. Topic model is one of the effective approaches, whose application domains include text mining, personalized recommendation and so on. Conventional models like pLSA and LDA are designed for long text data. However, these models may suffer from the sparsity problem brought by lacking words in short text scenarios. Recent studies such as BTM show that using word co-occurrent pairs is effective to relieve the sparsity problem. However, both BTM and extended models ignore the quantifiable relationship between words. From our perspectives, two more related words should occur in the same topic. Based on this idea, we introduce a model named RIBS, which makes use of RNN to learn relationship. By using the learned relationship, we introduce a model named RIBS-Bigrams, which can display topics with bigrams. Through experiments on two open-source and real-world datasets, RIBS achieves better coherence in topic discovery, and RIBS-Bigrams achieves better readability in topic display. In the document characterization task, the document representation of RIBS can lead better purity and entropy in clustering, higher accuracy in classification.

引用

页码：259 / 277

页数：19

共 33 条

[1] Alikaniotis Dimitrios, 2016, CORR
[2] Amiri H, 2016, AAAI CONF ARTIF INTE, P2566
[3] [Anonymous], 2005, Advances in neural information processing systems
[4] [Anonymous], 2008, ADV NEURAL INFORM PR
[5] [Anonymous], 2013, INT C MACHINE LEARNI
[6] Latent Dirichlet allocation
Blei, DM
Ng, AY
Jordan, MI
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
[7] Chen C., 2013, INT C WEB INF SYST E, P252
[8] Word co-occurrence augmented topic model in short text
Chen, Guan-Bin
Kao, Hung-Yu
[J]. INTELLIGENT DATA ANALYSIS, 2017, 21 : S55 - S70
[9] A Semantic Graph based Topic Model for Question Retrieval in Community Question Answering
Chen, Long
Jose, Joemon M.
Yu, Haitao
Yuan, Fajie
Zhang, Dell
[J]. PROCEEDINGS OF THE NINTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'16), 2016, : 287 - 296
[10] Dent K. D., 2011, Proceedings of the AAAI 2011 Workshop on Analyzing Microtext, P8

← 1 2 3 4 →