A semi-supervised framework for concept-based hierarchical document clustering

被引:0
|
作者
Sadjadi, Seyed Mojtaba [1 ]
Mashayekhi, Hoda [1 ]
Hassanpour, Hamid [1 ]
机构
[1] Shahrood Univ Technol, Fac Comp Engn, Shahrood, Iran
来源
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS | 2023年 / 26卷 / 06期
关键词
Semi-supervised clustering; Document clustering; Word embedding; Concept-based representation; Hierarchical clustering; ALGORITHM; WORDS;
D O I
10.1007/s11280-023-01209-4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text clustering is used in various applications of text analysis. In the clustering process, the employed document representation method has a significant impact on the results. Some popular document representation methods cannot effectively maintain the proximity information of the documents or suffer from low interpretability. Although the concept-based representation methods overcome these challenges to some extent, the existing semi-supervised document clustering methods rarely use this type of document representation. In this paper, we propose a concept-based semi-supervised framework for document clustering that uses both labeled and unlabeled data to yield a higher clustering quality. Concepts are composed of a set of semantically similar words. We propose the notion of semi-supervised concepts to benefit from document labels in extracting more relevant concepts. We also propose a new method of clustering documents based on the weights of such concepts. In the first and second steps of the proposed framework, the documents are represented based on the concepts extracted from the set of embedded words in the corpus. The proposed representation is interpretable and preserves the proximity information of documents. In the third step, the semi-supervised hierarchical clustering process utilizes unlabeled data to capture the overall structure of the clusters, and the supervision of a small number of labeled documents to adjust the cluster centroids. The use of concept vectors improves the process of merging clusters in the hierarchical clustering approach. The proposed framework is evaluated using the Reuters, 20-NewsGroups and WebKB text collections, and the results reveal the superiority of the proposed framework compared to several existing semi-supervised and unsupervised clustering approaches.
引用
收藏
页码:3861 / 3890
页数:30
相关论文
共 50 条
  • [31] A New semi-supervised clustering for incomplete data
    Goel, Sonia
    Tushir, Meena
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2022, 42 (02) : 727 - 739
  • [32] Semi-supervised clustering with inaccurate pairwise annotations
    Gribel, Daniel
    Gendreau, Michel
    Vidal, Thibaut
    INFORMATION SCIENCES, 2022, 607 : 441 - 457
  • [33] Two-stage semi-supervised clustering ensemble framework based on constraint weight
    Zhang, Ding
    Yang, Youlong
    Qiu, Haiquan
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2023, 14 (02) : 567 - 586
  • [34] Two-stage semi-supervised clustering ensemble framework based on constraint weight
    Ding Zhang
    Youlong Yang
    Haiquan Qiu
    International Journal of Machine Learning and Cybernetics, 2023, 14 : 567 - 586
  • [35] Hyper-Heuristic Framework for Sequential Semi-Supervised Classification Based on Core Clustering
    Adnan, Ahmed
    Muhammed, Abdullah
    Abd Ghani, Abdul Azim
    Abdullah, Azizol
    Hakim, Fahrul
    SYMMETRY-BASEL, 2020, 12 (08):
  • [36] Semi-supervised projected model-based clustering
    Guerra, Luis
    Bielza, Concha
    Robles, Victor
    Larranaga, Pedro
    DATA MINING AND KNOWLEDGE DISCOVERY, 2014, 28 (04) : 882 - 917
  • [37] Network anomaly detection based on semi-supervised clustering
    Wei Xiaotao
    Huang Houkuan
    Tian Shengfeng
    NEW ADVANCES IN SIMULATION, MODELLING AND OPTIMIZATION (SMO '07), 2007, : 440 - +
  • [38] A New Incremental Semi-Supervised Graph Based Clustering
    Vu Viet Thang
    Pashchenko, Fedor F.
    FIFTH INTERNATIONAL CONFERENCE ON ENGINEERING AND TELECOMMUNICATION (ENT-MIPT 2018), 2018, : 210 - 214
  • [39] Semi-Supervised Kernel-Based Temporal Clustering
    Araujo, Rodrigo
    Kamel, Mohamed S.
    2014 13TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2014, : 123 - 128
  • [40] Semi-supervised consensus clustering based on closed patterns
    Yang, Tianshu
    Pasquier, Nicolas
    Precioso, Frederic
    KNOWLEDGE-BASED SYSTEMS, 2022, 235