Validation of scientific topic models using graph analysis and corpus metadata

被引:0
作者
Manuel A. Vázquez
Jorge Pereira-Delgado
Jesús Cid-Sueiro
Jerónimo Arenas-García
机构
[1] Universidad Carlos III de Madrid,
来源
Scientometrics | 2022年 / 127卷
关键词
Topic modeling; Latent Dirichlet Allocation; Graph analysis; Semantic similarity; Model validation;
D O I
暂无
中图分类号
学科分类号
摘要
Probabilistic topic modeling algorithms like Latent Dirichlet Allocation (LDA) have become powerful tools for the analysis of large collections of documents (such as papers, projects, or funding applications) in science, technology an innovation (STI) policy design and monitoring. However, selecting an appropriate and stable topic model for a specific application (by adjusting the hyperparameters of the algorithm) is not a trivial problem. Common validation metrics like coherence or perplexity, which are focused on the quality of topics, are not a good fit in applications where the quality of the document similarity relations inferred from the topic model is especially relevant. Relying on graph analysis techniques, the aim of our work is to state a new methodology for the selection of hyperparameters which is specifically oriented to optimize the similarity metrics emanating from the topic model. In order to do this, we propose two graph metrics: the first measures the variability of the similarity graphs that result from different runs of the algorithm for a fixed value of the hyperparameters, while the second metric measures the alignment between the graph derived from the LDA model and another obtained using metadata available for the corresponding corpus. Through experiments on various corpora related to STI, it is shown that the proposed metrics provide relevant indicators to select the number of topics and build persistent topic models that are consistent with the metadata. Their use, which can be extended to other topic models beyond LDA, could facilitate the systematic adoption of this kind of techniques in STI policy analysis and design.
引用
收藏
页码:5441 / 5458
页数:17
相关论文
共 72 条
[1]  
Adebiyi A(2019)A comparative analysis of tf-idf, lsi and lda in semantic information retrieval approach for paper-reviewer assignment Journal of Engineering and Applied Sciences 14 3378-3382
[2]  
Ogunleye OM(2018)What is wrong with topic modeling? And how to fix it using search-based software engineering Information and Software Technology 98 74-88
[3]  
Adebiyi M(2006)Correlated topic models Advances in Neural Information Procesing Systems 18 147-1022
[4]  
Okesola J(2003)Latent dirichlet allocation Journal of machine Learning research 3 993-2108
[5]  
Agrawal A(2011)Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches PLoS ONE 6 e18029-1307
[6]  
Fu W(2020)Exploiting word embedding for heterogeneous topic model towards patent recommendation Scientometrics 125 2091-2595
[7]  
Menzies T(2021)A scientometric overview of cord-19 PLoS ONE 16 e0244839-272
[8]  
Blei D(2018)Content analysis of e-petitions with topic modeling: How to train and evaluate lda models? Information Processing & Management 54 1292-118
[9]  
Lafferty J(2020)Evolution of research topics in lis between 1996 and 2019: An analysis based on latent dirichlet allocation topic model Scientometrics 125 2561-687
[10]  
Blei DM(2019)Can topic models be used in research evaluations? Reproducibility, validity, and reliability when compared with semantic maps Research Evaluation 28 263-247