Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis

被引:167
作者
Albalawi, Rania [1 ]
Yeap, Tet Hin [1 ,2 ]
Benyoucef, Morad
机构
[1] Univ Ottawa, Sch Informat Technol & Engn, Ottawa, ON, Canada
[2] Univ Ottawa, Telfer Sch Management, Ottawa, ON, Canada
来源
FRONTIERS IN ARTIFICIAL INTELLIGENCE | 2020年 / 3卷
关键词
natural language processing; topic modeling; short text; user-generated content; online social networks;
D O I
10.3389/frai.2020.00042
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the growth of online social network platforms and applications, large amounts of textual user-generated content are created daily in the form of comments, reviews, and short-text messages. As a result, users often find it challenging to discover useful information or more on the topic being discussed from such content. Machine learning and natural language processing algorithms are used to analyze the massive amount of textual social media data available online, including topic modeling techniques that have gained popularity in recent years. This paper investigates the topic modeling subject and its common application areas, methods, and tools. Also, we examine and compare five frequently used topic modeling methods, as applied to short textual social data, to show their benefits practically in detecting important topics. These methods are latent semantic analysis, latent Dirichlet allocation, non-negative matrix factorization, random projection, and principal component analysis. Two textual datasets were selected to evaluate the performance of included topic modeling methods based on the topic quality and some standard statistical evaluation metrics, like recall, precision, F-score, and topic coherence. As a result, latent Dirichlet allocation and non-negative matrix factorization methods delivered more meaningful extracted topics and obtained good results. The paper sheds light on some common topic modeling methods in a short-text context and provides direction for researchers who seek to apply these methods.
引用
收藏
页数:14
相关论文
共 71 条
  • [11] Chang J., 2015, LATENT DIRICHLET ALL
  • [12] Experimental explorations on short text topic mining between LDA and NMF based Schemes
    Chen, Yong
    Zhang, Hui
    Liu, Rui
    Ye, Zhiwen
    Lin, Jianying
    [J]. KNOWLEDGE-BASED SYSTEMS, 2019, 163 : 1 - 13
  • [13] Chen Y, 2017, 2017 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), P306
  • [14] BTM: Topic Modeling over Short Texts
    Cheng, Xueqi
    Yan, Xiaohui
    Lan, Yanyan
    Guo, Jiafeng
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (12) : 2928 - 2941
  • [15] ChristopherManning D., 2009, NIPS 2009 WORKSH APP
  • [16] Dasgupta S., 2000, P 16 C UNC ART INT, P143
  • [17] DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO
  • [18] 2-9
  • [19] Dinakar Karthik., 2015, Proceedings of the 20th international conference on intelligent user interfaces, P417, DOI [10.1145/2678025.2701395, DOI 10.1145/2678025.2701395]
  • [20] Comparison of discrimination methods for the classification of tumors using gene expression data
    Dudoit, S
    Fridlyand, J
    Speed, TP
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (457) : 77 - 87