DATM: A Novel Data Agnostic Topic Modeling Technique With Improved Effectiveness for Both Short and Long Text

被引:3
作者
Bewong, Michael [1 ]
Wondoh, John [2 ]
Kwashie, Selasi [3 ]
Liu, Jixue [2 ]
Liu, Lin [2 ]
Li, Jiuyong [2 ]
Islam, Md. Zahidul [1 ]
Kernot, David [4 ]
机构
[1] Charles Sturt Univ, Sch Comp Math & Engn, Wagga Wagga, NSW 2650, Australia
[2] Univ South Australia, Sch Informat Technol & Math Sci, Adelaide, SA 5095, Australia
[3] Charles Sturt Univ, Artificial Intelligence & Cyber Futures Inst, Bathurst, NSW 2795, Australia
[4] Dept Def, Def Sci Technol Grp, Edinburgh, SA 5111, Australia
关键词
Data models; Reliability; Australia; Task analysis; Social networking (online); Context modeling; Benchmark testing; Document handling; Document transformation; greedy algorithm; information retrieval; latent dirichlet allocation; multi-set multi-cover problem; probabilistic generative topic modelling; APPROXIMATION ALGORITHMS;
D O I
10.1109/ACCESS.2023.3262653
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Topic modelling is important for tackling several data mining tasks in information retrieval. While seminal topic modelling techniques such as Latent Dirichlet Allocation (LDA) have been proposed, the ubiquity of social media and the brevity of its texts pose unique challenges for such traditional topic modelling techniques. Several extensions including auxiliary aggregation, self aggregation and direct learning have been proposed to mitigate these challenges, however some still remain. These include a lack of consistency in the topics generated and the decline in model performance in applications involving disparate document lengths. There is a recent paradigm shift towards neural topic models, which are not suited for resource-constrained environments. This paper revisits LDA-style techniques, taking a theoretical approach to analyse the relationship between word co-occurrence and topic models. Our analysis shows that by altering the word co-occurrences within the corpus, topic discovery can be enhanced. Thus we propose a novel data transformation approach dubbed DATM to improve the topic discovery within a corpus. A rigorous empirical evaluation shows that DATM is not only powerful, but it can also be used in conjunction with existing benchmark techniques to significantly improve their effectiveness and their consistency by up to 2 fold.
引用
收藏
页码:32826 / 32841
页数:16
相关论文
共 62 条
[1]  
Aletras N., 2013, P 10 INT C COMPUTATI, P13
[2]   Can We Predict a Riot? Disruptive Event Detection Using Twitter [J].
Alsaedi, Nasser ;
Burnap, Pete ;
Rana, Omer .
ACM TRANSACTIONS ON INTERNET TECHNOLOGY, 2017, 17 (02)
[3]  
Alvarez-Melis D., 2016, P INT AAAI C WEB SOC, P519, DOI DOI 10.1609/ICWSM.V10I1.14817
[4]  
[Anonymous], 2009, Advances in Neural Information Processing Systems, DOI DOI 10.1007/S10708-008-9161-9
[5]  
Dieng AB, 2019, Arxiv, DOI [arXiv:1907.05545, DOI 10.48550/ARXIV.1907.05545]
[6]  
Bache K, 2013, 19TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'13), P23
[7]   Probabilistic Topic Models [J].
Blei, David ;
Carin, Lawrence ;
Dunson, David .
IEEE SIGNAL PROCESSING MAGAZINE, 2010, 27 (06) :55-65
[8]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[9]  
Chauhan U., 2022, ACM Comput. Surv., V54, P1
[10]   People Opinion Topic Model: Opinion based User Clustering in Social Networks [J].
Chen, Hongxu ;
Yin, Hongzhi ;
Li, Xue ;
Wang, Meng ;
Chen, Weitong ;
Chen, Tong .
WWW'17 COMPANION: PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 2017, :1353-1359