Benchmarking topic models on scientific articles using BERTeley

被引:6
作者
Chagnon, Eric [1 ]
Pandolfi, Ronald [1 ]
Donatelli, Jeffrey [1 ]
Ushizima, Daniela [1 ]
机构
[1] Lawrence Berkeley National Lab, 1 Cyclotron Road, Berkeley,94720, United States
来源
Natural Language Processing Journal | 2024年 / 6卷
关键词
Computational linguistics - Natural language processing systems - Statistics;
D O I
10.1016/j.nlp.2023.100044
中图分类号
学科分类号
摘要
The introduction of BERTopic marked a crucial advancement in topic modeling and presented a topic model that outperformed both traditional and modern topic models in terms of topic modeling metrics on a variety of corpora. However, unique issues arise when topic modeling is performed on scientific articles. This paper introduces BERTeley, an innovative tool built upon BERTopic, designed to alleviate these shortcomings and improve the usability of BERTopic when conducting topic modeling on a corpus consisting of scientific articles. This is accomplished through BERTeley's three main features: scientific article preprocessing, topic modeling using pre-trained scientific language models, and topic model metric calculation. Furthermore, an experiment was conducted comparing topic models using four different language models in three corpora consisting of scientific articles. © 2023 The Author(s)
引用
收藏
相关论文
empty
未找到相关数据