A semi-supervised framework for concept-based hierarchical document clustering

被引:0
|
作者
Sadjadi, Seyed Mojtaba [1 ]
Mashayekhi, Hoda [1 ]
Hassanpour, Hamid [1 ]
机构
[1] Shahrood Univ Technol, Fac Comp Engn, Shahrood, Iran
来源
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS | 2023年 / 26卷 / 06期
关键词
Semi-supervised clustering; Document clustering; Word embedding; Concept-based representation; Hierarchical clustering; ALGORITHM; WORDS;
D O I
10.1007/s11280-023-01209-4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text clustering is used in various applications of text analysis. In the clustering process, the employed document representation method has a significant impact on the results. Some popular document representation methods cannot effectively maintain the proximity information of the documents or suffer from low interpretability. Although the concept-based representation methods overcome these challenges to some extent, the existing semi-supervised document clustering methods rarely use this type of document representation. In this paper, we propose a concept-based semi-supervised framework for document clustering that uses both labeled and unlabeled data to yield a higher clustering quality. Concepts are composed of a set of semantically similar words. We propose the notion of semi-supervised concepts to benefit from document labels in extracting more relevant concepts. We also propose a new method of clustering documents based on the weights of such concepts. In the first and second steps of the proposed framework, the documents are represented based on the concepts extracted from the set of embedded words in the corpus. The proposed representation is interpretable and preserves the proximity information of documents. In the third step, the semi-supervised hierarchical clustering process utilizes unlabeled data to capture the overall structure of the clusters, and the supervision of a small number of labeled documents to adjust the cluster centroids. The use of concept vectors improves the process of merging clusters in the hierarchical clustering approach. The proposed framework is evaluated using the Reuters, 20-NewsGroups and WebKB text collections, and the results reveal the superiority of the proposed framework compared to several existing semi-supervised and unsupervised clustering approaches.
引用
收藏
页码:3861 / 3890
页数:30
相关论文
共 50 条
  • [1] A semi-supervised framework for concept-based hierarchical document clustering
    Seyed Mojtaba Sadjadi
    Hoda Mashayekhi
    Hamid Hassanpour
    World Wide Web, 2023, 26 : 3861 - 3890
  • [2] Semi-supervised hierarchical clustering algorithms
    Amar, A
    Labzour, NT
    Bensaid, A
    SIXTH SCANDINAVIAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 1997, 40 : 232 - 239
  • [3] An active learning framework for semi-supervised document clustering with language modeling
    Huang, Ruizhang
    Lam, Wai
    DATA & KNOWLEDGE ENGINEERING, 2009, 68 (01) : 49 - 67
  • [4] A semi-supervised hierarchical ensemble clustering framework based on a novel similarity metric and stratified feature sampling
    Shi, Hui
    Peng, Qiang
    Xie, Zhiming
    Wang, Jian
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2023, 35 (08)
  • [5] A Framework for Semi-Supervised Clustering Based on Dimensionality Reduction
    Cui Peng
    Zhang Ru-bo
    FIRST INTERNATIONAL WORKSHOP ON DATABASE TECHNOLOGY AND APPLICATIONS, PROCEEDINGS, 2009, : 192 - +
  • [6] A nonnegative matrix factorization framework for semi-supervised document clustering with dual constraints
    Ma, Huifang
    Zhao, Weizhong
    Shi, Zhongzhi
    KNOWLEDGE AND INFORMATION SYSTEMS, 2013, 36 (03) : 629 - 651
  • [7] A nonnegative matrix factorization framework for semi-supervised document clustering with dual constraints
    Huifang Ma
    Weizhong Zhao
    Zhongzhi Shi
    Knowledge and Information Systems, 2013, 36 : 629 - 651
  • [8] Comparison of Semi-Supervised Hierarchical Clustering Using Clusterwise Tolerance
    Hamasuna, Yukihiro
    Endo, Yasunori
    JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2012, 16 (07) : 819 - 824
  • [9] Semi-supervised Document Clustering Based on Latent Dirichlet Allocation (LDA)
    秦永彬
    李解
    黄瑞章
    李晶
    JournalofDonghuaUniversity(EnglishEdition), 2016, 33 (05) : 685 - 688
  • [10] Semi-supervised model-based document clustering: A comparative study
    Shi Zhong
    Machine Learning, 2006, 65 : 3 - 29