A Phrase Topic Model for Large-scale Corpus

被引:0
作者
Li, Baoji [1 ]
Xu, Wenhua [1 ]
Tian, Yuhui [1 ]
Chen, Juan [1 ]
机构
[1] Ocean Univ China, Coll Informat Sci & Engn, Qingdao, Shandong, Peoples R China
来源
2019 IEEE 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYSIS (ICCCBDA) | 2019年
关键词
large-scale corpus; Latent Dirichlet Allocation; phrase topic model; regular expression;
D O I
10.1109/icccbda.2019.8725681
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The topic model is an unsupervised learning model, one of the important tools for large-scale corpus analysis, widely used in information retrieval, natural language processing, and machine learning. Traditional topic models, such as Latent Dirichlet Allocation (LDA), ignore the order of words. However, in many text-mining tasks, word order and phrases are often crucial for capturing the meaning of texts efficiently. We propose a phrase topic model based on the LDA model, which integrates a regular expression constraint condition. Our model makes the topic more meaningful and interpretable based on a limited increase in the dimensions of the vocabulary. The experimental results show that our algorithm can find meaningful phrases and have generic applicability in our test data set.
引用
收藏
页码:634 / 639
页数:6
相关论文
共 15 条
[1]  
Arnold C, 2012, SIGIR 2012: PROCEEDINGS OF THE 35TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P1031, DOI 10.1145/2348283.2348454
[2]  
Bhatia S, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P844
[3]  
Blei David M, 2009, ARXIV09071013
[4]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[5]   An empirical study of smoothing techniques for language modeling [J].
Chen, SF ;
Goodman, J .
COMPUTER SPEECH AND LANGUAGE, 1999, 13 (04) :359-394
[6]  
Danilevsky Marina, 2014, P 2014 SIAM INT C DA, P398
[7]  
Frantzi K. T., 1998, Research and Advanced Technology for Digital Libraries. Second European Conference, ECDL'98. Proceedings, P585
[8]   Finding scientific topics [J].
Griffiths, TL ;
Steyvers, M .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2004, 101 :5228-5235
[9]  
He YL, 2016, AAAI CONF ARTIF INTE, P2957
[10]  
Hofmann Thomas, 2017, ACM SIGIR Forum, V51, P211, DOI 10.1145/3130348.3130370