Assessment of the Quality of Topic Models for Information Retrieval Applications

被引:1
作者
Yuan, Meng [1 ]
Lin, Pauline [1 ]
Rashidi, Lida [1 ]
Zobel, Justin [1 ]
机构
[1] Univ Melbourne, Parkville, Vic, Australia
来源
PROCEEDINGS OF THE 2023 ACM SIGIR INTERNATIONAL CONFERENCE ON THE THEORY OF INFORMATION RETRIEVAL, ICTIR 2023 | 2023年
关键词
topic modelling; topic coherence; collection representation; PHRASE;
D O I
10.1145/3578337.3605118
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Topic modelling is an approach to generation of descriptions of document collections as a set of topics where each has a distinct theme and documents are a blend of topics. It has been applied to retrieval in a range of ways, but there has been little prior work on measurement of whether the topics are descriptive in this context. Moreover, existing methods for assessment of topic quality do not consider how well individual documents are described. To address this issue we propose a new measure of topic quality, which we call specificity; the basis of this measure is the extent to which individual documents are described by a limited number of topics. We also propose a new experimental protocol for validating topic-quality measures, a 'noise dial' that quantifies the extent to which the measure's scores are altered as the topics are degraded by addition of noise. The principle of the mechanism is that a meaningful measure should produce low scores if the 'topics' are essentially random. We show that specificity is at least as effective as existing measures of topic quality and does not require external resources. While other measures relate only to topics, not to documents, we further show that specificity correlates to the extent to which topic models are informative in the retrieval process.
引用
收藏
页码:265 / 274
页数:10
相关论文
共 41 条
  • [1] Topic modeling algorithms and applications: A survey
    Abdelrazek, Aly
    Eid, Yomna
    Gawish, Eman
    Medhat, Walaa
    Hassan, Ahmed
    [J]. INFORMATION SYSTEMS, 2023, 112
  • [2] Akhtar N., 2019, 2019 12 INT C CONT C, P16, DOI DOI 10.1109/IC3.2019.8844939
  • [3] LDA-AdaBoost.MH: Accelerated AdaBoost.MH based on latent Dirichlet allocation for text categorization
    Al-Salemi, Bassam
    Ab Aziz, Mohd. Juzaiddin
    Noah, Shahrul Azman
    [J]. JOURNAL OF INFORMATION SCIENCE, 2015, 41 (01) : 27 - 40
  • [4] Task-Driven Comparison of Topic Models
    Alexander, Eric
    Gleicher, Michael
    [J]. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2016, 22 (01) : 320 - 329
  • [5] An LDA-Based Approach to Scientific Paper Recommendation
    Amami, Maha
    Pasi, Gabriella
    Stella, Fabio
    Faiz, Rim
    [J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, NLDB 2016, 2016, 9612 : 200 - 210
  • [6] [Anonymous], 2005, TREC: Experiment and evaluation in information retrieval
  • [7] Asuncion Arthur., 2008, NIPS, P81, DOI DOI 10.5555/2981780.2981791
  • [8] On a Topic Model for Sentences
    Balikas, Georgios
    Amini, Massih-Reza
    Clausel, Marianne
    [J]. SIGIR'16: PROCEEDINGS OF THE 39TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2016, : 921 - 924
  • [9] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
  • [10] Chang J., 2009, P 22 INT C NEUR INF, P288