A semi-supervised framework for concept-based hierarchical document clustering

被引:0
作者
Sadjadi, Seyed Mojtaba [1 ]
Mashayekhi, Hoda [1 ]
Hassanpour, Hamid [1 ]
机构
[1] Shahrood Univ Technol, Fac Comp Engn, Shahrood, Iran
来源
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS | 2023年 / 26卷 / 06期
关键词
Semi-supervised clustering; Document clustering; Word embedding; Concept-based representation; Hierarchical clustering; ALGORITHM; WORDS;
D O I
10.1007/s11280-023-01209-4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text clustering is used in various applications of text analysis. In the clustering process, the employed document representation method has a significant impact on the results. Some popular document representation methods cannot effectively maintain the proximity information of the documents or suffer from low interpretability. Although the concept-based representation methods overcome these challenges to some extent, the existing semi-supervised document clustering methods rarely use this type of document representation. In this paper, we propose a concept-based semi-supervised framework for document clustering that uses both labeled and unlabeled data to yield a higher clustering quality. Concepts are composed of a set of semantically similar words. We propose the notion of semi-supervised concepts to benefit from document labels in extracting more relevant concepts. We also propose a new method of clustering documents based on the weights of such concepts. In the first and second steps of the proposed framework, the documents are represented based on the concepts extracted from the set of embedded words in the corpus. The proposed representation is interpretable and preserves the proximity information of documents. In the third step, the semi-supervised hierarchical clustering process utilizes unlabeled data to capture the overall structure of the clusters, and the supervision of a small number of labeled documents to adjust the cluster centroids. The use of concept vectors improves the process of merging clusters in the hierarchical clustering approach. The proposed framework is evaluated using the Reuters, 20-NewsGroups and WebKB text collections, and the results reveal the superiority of the proposed framework compared to several existing semi-supervised and unsupervised clustering approaches.
引用
收藏
页码:3861 / 3890
页数:30
相关论文
共 50 条
  • [41] Semi-supervised clustering based on affinity propagation algorithm
    Xiao, Yu
    Yu, Jian
    Ruan Jian Xue Bao/Journal of Software, 2008, 19 (11): : 2803 - 2813
  • [42] Knowledge augmentation-based soft constraints for semi-supervised clustering
    Zhang, Zhanhu
    Yu, Xia
    Tao, Rui
    Zhang, Xinyu
    Li, Hongru
    Lu, Jingyi
    Zhou, Jian
    APPLIED SOFT COMPUTING, 2023, 144
  • [43] A New Incremental Semi-Supervised Graph Based Clustering
    Vu Viet Thang
    Pashchenko, Fedor F.
    FIFTH INTERNATIONAL CONFERENCE ON ENGINEERING AND TELECOMMUNICATION (ENT-MIPT 2018), 2018, : 210 - 214
  • [44] Hierarchical Clustering Using Transitive Closure and Semi-supervised Classification Based on Fuzzy Rough Approximation
    Miyamoto, Sadaaki
    Takumi, Satoshi
    2012 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING (GRC 2012), 2012, : 359 - 364
  • [45] Spectral clustering: A semi-supervised approach
    Chen, Weifu
    Feng, Guocan
    NEUROCOMPUTING, 2012, 77 (01) : 229 - 242
  • [46] Research Progress on Semi-Supervised Clustering
    Yue Qin
    Shifei Ding
    Lijuan Wang
    Yanru Wang
    Cognitive Computation, 2019, 11 : 599 - 612
  • [47] Hyperspectral Tissue Image Segmentation Using Semi-Supervised NMF and Hierarchical Clustering
    Kumar, Neeraj
    Uppala, Phanikrishna
    Duddu, Karthik
    Sreedhar, Had
    Varma, Vishal
    Guzman, Grace
    Walsh, Michael
    Sethi, Amit
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2019, 38 (05) : 1304 - 1313
  • [48] Semi-Supervised Clustering for Architectural Modularisation
    Feist, Sofia
    Sanhudo, Luis
    Esteves, Vitor
    Pires, Miguel
    Costa, Antonio Aguiar
    BUILDINGS, 2022, 12 (03)
  • [49] Semi-supervised Agglomerative Hierarchical Clustering with Ward Method Using Clusterwise Tolerance
    Hamasuna, Yukihiro
    Endo, Yasunori
    Miyamoto, Sadaaki
    MODELING DECISIONS FOR ARTIFICIAL INTELLIGENCE, MDAI 2011, 2011, 6820 : 103 - +
  • [50] Fast semi-supervised evidential clustering
    Antoine, Violaine
    Guerrero, Jose A.
    Xie, Jiarui
    INTERNATIONAL JOURNAL OF APPROXIMATE REASONING, 2021, 133 (133) : 116 - 132