Leveraging Multiple Representations of Topic Models for Knowledge Discovery

被引：0

作者：

Potts, Colin M. ^{[1
]}

Savaliya, Akshat ^{[2
,3
]}

Jhala, Arnav ^{[1
]}

机构：

[1] North Carolina State Univ, Dept Comp Sci, Raleigh, NC 27695 USA

[2] Northeastern Univ, Khoury Coll Comp Sci, Boston, MA 02115 USA

[3] Amazon Web Serv AWS, Dallas, TX 75240 USA

来源：

IEEE ACCESS | 2022年 / 10卷

关键词：

Artificial intelligence; Data analysis; Analytical models; Knowledge discovery; Computational modeling; Clustering algorithms; Semantics; Natural language processing; big data applications; data analysis; data visualization; knowledge discovery; natural language processing;

D O I：

10.1109/ACCESS.2022.3210529

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Topic models are often useful in categorization of related documents in information retrieval and knowledge discovery systems, especially for large datasets. Interpreting the output of these models remains an ongoing challenge for the research community. The typical practice in the application of topic models is to tune the parameters of a chosen model for a target dataset and select the model with the best output based on a given metric. We present a novel perspective on topic analysis by presenting a process for combining output from multiple models with different theoretical underpinnings. We show that this results in our ability to tackle novel tasks such as semantic characterization of content that cannot be carried out by using single models. One example task is to characterize the differences between topics or documents in terms of their purpose and also importance with respect to the underlying output of the discovery algorithm. To show the potential benefit of leveraging multiple models we present an algorithm to map the term-space of Latent Dirichlet Allocation (LDA) to the neural document-embedding space of doc2vec. We also show that by utilizing both models in parallel and analyzing the resulting document distributions using the Normalized Pointwise Mutual Information (NPMI) metric we can gain insight into the purpose and importance of topics across models. This approach moves beyond topic identification to a richer characterization of the information and provides a better understanding of the complex relationships between these typically competing techniques.

引用

页码：104696 / 104705

页数：10

共 34 条

[1] What is wrong with topic modeling? And how to fix it using search-based software engineering [J].

Agrawal, Amritanshu ;

Fu, Wei ;

Menzies, Tim .

INFORMATION AND SOFTWARE TECHNOLOGY, 2018, 98 :74-88

[2]

Ankerst M, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P49

[3] Measuring Similarity Similarly: LDA and Human Perception [J].

Ben Towne, W. ;

Rose, Carolyn P. ;

Herbsleb, James D. .

ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2016, 8 (01)

[4] A neural probabilistic language model [J].

Bengio, Y ;

Ducharme, R ;

Vincent, P ;

Jauvin, C .

JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (06) :1137-1155

[5]

Bhatia Shraey., 2016, Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers, P953

[6]

Blei D. M., 2001, SIGIR Forum, P343

[7] Latent Dirichlet allocation [J].

Blei, DM ;

Ng, AY ;

Jordan, MI .

JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022

[8]

Bojanowski P., 2017, Transactions of the association for computational linguistics, V5, P135, DOI [10.1162/tacl_a_00051, 10.1162/tacla00051, DOI 10.1162/TACL_A_00051]

[9] Topic-Noise Models: Modeling Topic and Noise Distributions in Social Media Post Collections [J].

Churchill, Rob ;

Singh, Lisa .

2021 21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2021), 2021, :71-80

[10]

Du L., 2015, P 24 INT JOINT C ART, P1, DOI DOI 10.1109/CSTIC.2015.7153433

← 1 2 3 4 →