Analysis and tuning of hierarchical topic models based on Renyi entropy approach

被引:0
|
作者
Koltcov S. [1 ]
Ignatenko V. [1 ]
Terpilovskii M. [1 ]
Rosso P. [1 ,2 ]
机构
[1] Laboratory for Social and Cognitive Informatics, National Research University Higher School of Economics, St. Petersburg
[2] Pattern Recognition and Human Language Technology Research Center, Universitat Politècnica de València, Valencia
关键词
Data Mining and Machine Learning; Data Science; Hierarchical topic models; Natural Language and Speech; Optimal number of topics; Renyi entropy; Topic modeling;
D O I
10.7717/PEERJ-CS.608
中图分类号
学科分类号
摘要
Hierarchical topic modeling is a potentially powerful instrument for determining topical structures of text collections that additionally allows constructing a hierarchy representing the levels of topic abstractness. However, parameter optimization in hierarchical models, which includes finding an appropriate number of topics at each level of hierarchy, remains a challenging task. In this paper, we propose an approach based on Renyi entropy as a partial solution to the above problem. First, we introduce a Renyi entropy-based metric of quality for hierarchical models. Second, we propose a practical approach to obtaining the “correct” number of topics in hierarchical topic models and show how model hyperparameters should be tuned for that purpose. We test this approach on the datasets with the known number of topics, as determined by the human mark-up, three of these datasets being in the English language and one in Russian. In the numerical experiments, we consider three different hierarchical models: hierarchical latent Dirichlet allocation model (hLDA), hierarchical Pachinko allocation model (hPAM), and hierarchical additive regularization of topic models (hARTM). We demonstrate that the hLDA model possesses a significant level of instability and, moreover, the derived numbers of topics are far from the true numbers for the labeled datasets. For the hPAM model, the Renyi entropy approach allows determining only one level of the data structure. For hARTM model, the proposed approach allows us to estimate the number of topics for two levels of hierarchy. © 2021 Koltcov et al. All Rights Reserved.
引用
收藏
页码:1 / 35
页数:34
相关论文
共 50 条
  • [41] Design and Implementation of Photoacoustic Image Reconstruction Algorithm Based on Renyi Entropy Filter
    Wang Rong
    Wang Yi-Ping
    He Zai-Qian
    Luo Cui-Xian
    Li Peng-Wei
    Hu Jie
    Jiang Hua-Bei
    Zhang Wen-Dong
    PROGRESS IN BIOCHEMISTRY AND BIOPHYSICS, 2017, 44 (11) : 1026 - 1036
  • [42] Formulating the shear stress distribution in circular open channels based on the Renyi entropy
    Khozani, Zohreh Sheikh
    Bonakdari, Hossein
    PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2018, 490 : 114 - 126
  • [43] Automatic Topic Labeling using Ontology-based Topic Models
    Allahyari, Mehdi
    Kochut, Krys
    2015 IEEE 14TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2015, : 259 - 264
  • [44] Incremental Learning Algorithm of Least Squares Support Vector Machines Based on Renyi Entropy
    Zhao Guan-hua
    Hao Min
    2009 INTERNATIONAL CONFERENCE ON MANAGEMENT SCIENCE & ENGINEERING (16TH), VOLS I AND II, CONFERENCE PROCEEDINGS, 2009, : 95 - +
  • [45] RENYI ENTROPY BASED MUTUAL INFORMATION FOR SEMI-SUPERVISED BIRD VOCALIZATION SEGMENTATION
    Thakur, Anshul
    Abrol, Vinayak
    Sharma, Pulkit
    Rajan, Padmanabhan
    2017 IEEE 27TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, 2017,
  • [46] Ground target detection based on discrete cosine transform and Renyi entropy for imaging ladar
    Xu Yuannan
    Chen Weili
    Li Junwei
    Dong Yanbing
    SELECTED PAPERS OF THE PHOTOELECTRONIC TECHNOLOGY COMMITTEE CONFERENCES HELD NOVEMBER 2015, 2016, 9796
  • [47] On connections between Renyi entropy Principal Component Analysis, kernel learning and graph embedding
    Ran, Zhi-Yong
    Wang, Wei
    Hu, Bao-Gang
    PATTERN RECOGNITION LETTERS, 2018, 112 : 125 - 130
  • [48] Automated Topic Analysis with Large Language Models
    Kirilenko, Andrei
    Stepchenkova, Svetlana
    INFORMATION AND COMMUNICATION TECHNOLOGIES IN TOURISM 2024, ENTER 2024, 2024, : 29 - 34
  • [49] Semantic topic models for source code analysis
    Mahmoud, Anas
    Bradshaw, Gary
    EMPIRICAL SOFTWARE ENGINEERING, 2017, 22 (04) : 1965 - 2000
  • [50] Bearing performance degradation assessment based on Renyi entropy and K-medoids clustering
    Zhang L.
    Song C.
    Zou Y.
    Hong C.
    Wang C.
    Zhendong yu Chongji/Journal of Vibration and Shock, 2020, 39 (20): : 24 - 31and46