Integrating Text Classification into Topic Discovery Using Semantic Embedding Models

被引:0
作者
Lezama-Sanchez, Ana Laura [1 ]
Vidal, Mireya Tovar [1 ]
Reyes-Ortiz, Jose A. [2 ]
机构
[1] Benemerita Univ Autonoma Puebla, Fac Comp Sci, Puebla 72570, Mexico
[2] Univ Autonoma Metropolitana, Dept Sistemas, Mexico City 02200, Mexico
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 17期
关键词
deep learning; natural language processing; topic discovery; text classification;
D O I
10.3390/app13179857
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Topic discovery involves identifying the main ideas within large volumes of textual data. It indicates recurring topics in documents, providing an overview of the text. Current topic discovery models receive the text, with or without pre-processing, including stop word removal, text cleaning, and normalization (lowercase conversion). A topic discovery process that receives general domain text with or without processing generates general topics. General topics do not offer detailed overviews of the input text, and manual text categorization is tedious and time-consuming. Extracting topics from text with an automatic classification task is necessary to generate specific topics enriched with top words that maintain semantic relationships among them. Therefore, this paper presents an approach that integrates text classification for topic discovery from large amounts of English textual data, such as 20-Newsgroups and Reuters Corpora. We rely on integrating automatic text classification before the topic discovery process to obtain specific topics for each class with relevant semantic relationships between top words. Text classification performs a word analysis that makes up a document to decide what class or category to identify; then, the proposed integration provides latent and specific topics depicted by top words with high coherence from each obtained class. Text classification accomplishes this with a convolutional neural network (CNN), incorporating an embedding model based on semantic relationships. Topic discovery over categorized text is realized with latent Dirichlet analysis (LDA), probabilistic latent semantic analysis (PLSA), and latent semantic analysis (LSA) algorithms. An evaluation process for topic discovery over categorized text was performed based on the normalized topic coherence metric. The 20-Newsgroups corpus was classified, and twenty topics with the ten top words were identified for each class. The normalized topic coherence obtained was 0.1723 with LDA, 0.1622 with LSA, and 0.1716 with PLSA. The Reuters Corpus was also classified, and twenty and fifty topics were identified. A normalized topic coherence of 0.1441 was achieved when applying the LDA algorithm, obtaining 20 topics for each class; with LSA, the coherence was 0.1360, and with PLSA, it was 0.1436.
引用
收藏
页数:15
相关论文
共 52 条
  • [1] Athiwaratkun B, 2018, Arxiv, DOI arXiv:1806.02901
  • [2] Hierarchical Topic Model Inference by Community Discovery on Word Co-occurrence Networks
    Austin, Eric
    Trabelsi, Amine
    Largeron, Christine
    Zaiane, Osmar R.
    [J]. DATA MINING, AUSDM 2022, 2022, 1741 : 148 - 162
  • [3] Bianchi F, 2021, Arxiv, DOI arXiv:2004.03974
  • [4] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
  • [5] Bovi Claudio Delli, 2015, Transactions of the Association for Computational Linguistics, V3, P529
  • [6] Lexicon-based sentiment analysis to detect opinions and attitude towards COVID-19 vaccines on Twitter in Italy
    Catelli, Rosario
    Pelosi, Serena
    Comito, Carmela
    Pizzuti, Clara
    Esposito, Massimo
    [J]. COMPUTERS IN BIOLOGY AND MEDICINE, 2023, 158
  • [7] TopicStriKer: A topic kernels-powered approach for text classification
    Chandran, Nikhil, V
    Anoop, V. S.
    Asharaf, S.
    [J]. RESULTS IN ENGINEERING, 2023, 17
  • [8] Bert-Based Latent Semantic Analysis (Bert-LSA): A Case Study on Geospatial Data Technology and Application Trend Analysis
    Cheng, Quanying
    Zhu, Yunqiang
    Song, Jia
    Zeng, Hongyun
    Wang, Shu
    Sun, Kai
    Zhang, Jinqu
    [J]. APPLIED SCIENCES-BASEL, 2021, 11 (24):
  • [9] Emerging Trends: SOTA-Chasing
    Church, Kenneth Ward
    Kordoni, Valia
    [J]. NATURAL LANGUAGE ENGINEERING, 2022, 28 (02) : 249 - 269
  • [10] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805