Emerging topic detection in twitter stream based on high utility pattern mining

被引:80
作者
Choi, Hyeok-Jun [1 ]
Park, Cheong Hee [1 ]
机构
[1] Chungnam Natl Univ, Dept Comp Sci & Engn, Daejeon, South Korea
基金
新加坡国家研究基金会;
关键词
Frequent pattern mining; High utility pattern mining; Topic detection; Twitter stream; FREQUENT;
D O I
10.1016/j.eswa.2018.07.051
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Among internet and smart device applications, Twitter has become a leading social media platform, disseminating online events occurring in the world on a real-time basis. Many studies have been conducted to identify valuable information on Twitter. Recently, Frequent Pattern Mining has been applied for topic detection on Twitter. In Frequent Pattern Mining, a topic is considered to be a group of words that appear simultaneously, however, the method only considers the frequency of words, and their utility for topic detection is not considered in the process of pattern generation. In this paper, we propose a method to detect emerging topics on Twitter based on High Utility Pattern Mining (HUPM), which takes frequency and utility into account at the same time. For a chunk of tweets by time-based windowing on the Twitter stream, we define the utility of words based on the growth rate in frequency and find groups of words with high frequency and high utility by HUPM. For post-processing to extract actual topic patterns from candidate topic patterns generated by HUPM, an efficient data structure called Topic-tree (TP-Tree) is also proposed. Experimental results demonstrated the effectiveness of the proposed method, which showed superior performance and shorter running time than other tested topic detection methods. In particular, the proposed method showed a 5% higher topic recall than the other compared methods for the three datasets used. (C) 2018 Elsevier Ltd. All rights reserved.
引用
收藏
页码:27 / 36
页数:10
相关论文
共 34 条
[1]  
[Anonymous], P 17 C INF KNOWL MAN
[2]  
[Anonymous], 2010, HLT 10
[3]  
[Anonymous], 2000, SIGMOD, DOI DOI 10.1145/342009.335372
[4]  
[Anonymous], 2017, TWITT NUMB MONTHL AC
[5]  
Becker H., 2011, Icwsm
[6]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[7]   Finding Frequent Items in Data Streams [J].
Cormode, Graham ;
Hadjieleftheriou, Marios .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (02) :1530-1541
[8]   Approximate TF-IDF based on topic extraction from massive message stream using the GPU [J].
Erra, Ugo ;
Senatore, Sabrina ;
Minnella, Fernando ;
Caggianese, Giuseppe .
INFORMATION SCIENCES, 2015, 292 :143-161
[9]   TRIE MEMORY [J].
FREDKIN, E .
COMMUNICATIONS OF THE ACM, 1960, 3 (09) :490-499
[10]  
Gaglio S, 2015, COMM ICC 2015 IEEE I, P1