Assessment of the Quality of Topic Models for Information Retrieval Applications

被引：1

作者：

Yuan, Meng ^{[1
]}

Lin, Pauline ^{[1
]}

Rashidi, Lida ^{[1
]}

Zobel, Justin ^{[1
]}

机构：

[1] Univ Melbourne, Parkville, Vic, Australia

来源：

PROCEEDINGS OF THE 2023 ACM SIGIR INTERNATIONAL CONFERENCE ON THE THEORY OF INFORMATION RETRIEVAL, ICTIR 2023 | 2023年

关键词：

topic modelling; topic coherence; collection representation; PHRASE;

D O I：

10.1145/3578337.3605118

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Topic modelling is an approach to generation of descriptions of document collections as a set of topics where each has a distinct theme and documents are a blend of topics. It has been applied to retrieval in a range of ways, but there has been little prior work on measurement of whether the topics are descriptive in this context. Moreover, existing methods for assessment of topic quality do not consider how well individual documents are described. To address this issue we propose a new measure of topic quality, which we call specificity; the basis of this measure is the extent to which individual documents are described by a limited number of topics. We also propose a new experimental protocol for validating topic-quality measures, a 'noise dial' that quantifies the extent to which the measure's scores are altered as the topics are degraded by addition of noise. The principle of the mechanism is that a meaningful measure should produce low scores if the 'topics' are essentially random. We show that specificity is at least as effective as existing measures of topic quality and does not require external resources. While other measures relate only to topics, not to documents, we further show that specificity correlates to the extent to which topic models are informative in the retrieval process.

引用

页码：265 / 274

页数：10

共 41 条

[1] Topic modeling algorithms and applications: A survey
Abdelrazek, Aly
Eid, Yomna
Gawish, Eman
Medhat, Walaa
Hassan, Ahmed
[J]. INFORMATION SYSTEMS, 2023, 112
[2] Akhtar N., 2019, 2019 12 INT C CONT C, P16, DOI DOI 10.1109/IC3.2019.8844939
[3] LDA-AdaBoost.MH: Accelerated AdaBoost.MH based on latent Dirichlet allocation for text categorization
Al-Salemi, Bassam
Ab Aziz, Mohd. Juzaiddin
Noah, Shahrul Azman
[J]. JOURNAL OF INFORMATION SCIENCE, 2015, 41 (01) : 27 - 40
[4] Task-Driven Comparison of Topic Models
Alexander, Eric
Gleicher, Michael
[J]. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2016, 22 (01) : 320 - 329
[5] An LDA-Based Approach to Scientific Paper Recommendation
Amami, Maha
Pasi, Gabriella
Stella, Fabio
Faiz, Rim
[J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, NLDB 2016, 2016, 9612 : 200 - 210
[6] [Anonymous], 2005, TREC: Experiment and evaluation in information retrieval
[7] Asuncion Arthur., 2008, NIPS, P81, DOI DOI 10.5555/2981780.2981791
[8] On a Topic Model for Sentences
Balikas, Georgios
Amini, Massih-Reza
Clausel, Marianne
[J]. SIGIR'16: PROCEEDINGS OF THE 39TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2016, : 921 - 924
[9] Latent Dirichlet allocation
Blei, DM
Ng, AY
Jordan, MI
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
[10] Chang J., 2009, P 22 INT C NEUR INF, P288

← 1 2 3 4 5 →