TOPIC MODEL AND SIMILARITY CALCULATION OF TEXT ON SPARK

被引:0
作者
Dai, Changsong [1 ]
Wang, Yongbin [2 ]
Wang, Qi [3 ]
机构
[1] Commun Univ China, Internet Informat Res Inst, Beijing 100024, Peoples R China
[2] Commun Univ China, Collaborat Innovat Ctr, Beijing, Peoples R China
[3] Commun Univ China, Dept Technol, Beijing, Peoples R China
来源
2017 14TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIVE MEDIA TECHNOLOGY AND INFORMATION PROCESSING (ICCWAMTIP) | 2017年
关键词
LDA; Spark; Text similarity; Topic model; Data mining;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Since the 21st century, with the popularization and development of the Internet, the volume of text data on the Internet has been increasing rapidly. To extract the theme of people, need from text content, at the same time in order to solve problems such as excessive dimension of vector in model and dependence of word frequency, David M. Blei put forward a helpful and efficient method based on topic model, which also known as implicit Dirichlet distribution (Latent Dirichlet Allocation). The LDA model uses the Gibbs sampling method to estimate the parameters of the document-topic, topic-word distribution pattern, and express the text content as the probability model of the subject model to calculate the similarity of the document. In this paper, Spark as a LDA algorithm to run the computing platform, the use of Spark parallel computing advantages of a large number of text corpus processing, and on the basis of LDA proposed a document similarity calculation method.
引用
收藏
页码:15 / 19
页数:5
相关论文
共 7 条
[1]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[2]  
Liang Jian, 2016, PERSONALIZED PUSH BA, P6
[3]   A local context-aware LDA model for topic modeling in a document network [J].
Liu, Yang ;
Xu, Songhua .
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2017, 68 (06) :1429-1448
[4]  
Tang LiZhe, 2017, J COMPUTER APPL, V38, P265
[5]  
Wang Hongxu, 2015, Journal of Frontiers of Computer Science and Technology, V9, P1066, DOI 10.3778/j.issn.1673-9418.1411045
[6]  
Xiao Jian, 2016, STUDY SPARK PARALLEL, P5
[7]  
Zhao Xinglei, 2001, J BEIJING U INFORM T, V32, P70