Benchmarking topic models on scientific articles using BERTeley

被引：6

作者：

Chagnon, Eric ^{[1
]}

Pandolfi, Ronald ^{[1
]}

Donatelli, Jeffrey ^{[1
]}

Ushizima, Daniela ^{[1
]}

机构：

[1] Lawrence Berkeley National Lab, 1 Cyclotron Road, Berkeley,94720, United States

来源：

Natural Language Processing Journal | 2024年 / 6卷

关键词：

Computational linguistics - Natural language processing systems - Statistics;

D O I：

10.1016/j.nlp.2023.100044

中图分类号：

学科分类号：

摘要：

The introduction of BERTopic marked a crucial advancement in topic modeling and presented a topic model that outperformed both traditional and modern topic models in terms of topic modeling metrics on a variety of corpora. However, unique issues arise when topic modeling is performed on scientific articles. This paper introduces BERTeley, an innovative tool built upon BERTopic, designed to alleviate these shortcomings and improve the usability of BERTopic when conducting topic modeling on a corpus consisting of scientific articles. This is accomplished through BERTeley's three main features: scientific article preprocessing, topic modeling using pre-trained scientific language models, and topic model metric calculation. Furthermore, an experiment was conducted comparing topic models using four different language models in three corpora consisting of scientific articles. © 2023 The Author(s)

引用