A semi-supervised framework for concept-based hierarchical document clustering

被引:1
作者
Sadjadi, Seyed Mojtaba [1 ]
Mashayekhi, Hoda [1 ]
Hassanpour, Hamid [1 ]
机构
[1] Shahrood Univ Technol, Fac Comp Engn, Shahrood, Iran
来源
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS | 2023年 / 26卷 / 06期
关键词
Semi-supervised clustering; Document clustering; Word embedding; Concept-based representation; Hierarchical clustering; ALGORITHM; WORDS;
D O I
10.1007/s11280-023-01209-4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text clustering is used in various applications of text analysis. In the clustering process, the employed document representation method has a significant impact on the results. Some popular document representation methods cannot effectively maintain the proximity information of the documents or suffer from low interpretability. Although the concept-based representation methods overcome these challenges to some extent, the existing semi-supervised document clustering methods rarely use this type of document representation. In this paper, we propose a concept-based semi-supervised framework for document clustering that uses both labeled and unlabeled data to yield a higher clustering quality. Concepts are composed of a set of semantically similar words. We propose the notion of semi-supervised concepts to benefit from document labels in extracting more relevant concepts. We also propose a new method of clustering documents based on the weights of such concepts. In the first and second steps of the proposed framework, the documents are represented based on the concepts extracted from the set of embedded words in the corpus. The proposed representation is interpretable and preserves the proximity information of documents. In the third step, the semi-supervised hierarchical clustering process utilizes unlabeled data to capture the overall structure of the clusters, and the supervision of a small number of labeled documents to adjust the cluster centroids. The use of concept vectors improves the process of merging clusters in the hierarchical clustering approach. The proposed framework is evaluated using the Reuters, 20-NewsGroups and WebKB text collections, and the results reveal the superiority of the proposed framework compared to several existing semi-supervised and unsupervised clustering approaches.
引用
收藏
页码:3861 / 3890
页数:30
相关论文
共 60 条
[1]  
Agarwal Rohit, 2021, 2021 2nd International Conference on Computation, Automation and Knowledge Management (ICCAKM), P332, DOI 10.1109/ICCAKM50778.2021.9357720
[2]   Text normalization and semantic indexing to enhance Instant Messaging and SMS spam filtering [J].
Almeida, Tiago A. ;
Silva, Tiago P. ;
Santos, Igor ;
Gomez Hidalgo, Jose M. .
KNOWLEDGE-BASED SYSTEMS, 2016, 108 :25-32
[3]  
Basu S, 2009, CH CRC DATA MIN KNOW, P1
[4]  
Dai A.M., 2015, Document Embedding with Paragraph Vectors, P1
[5]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[6]   Automatic constraints generation for semisupervised clustering: experiences with documents classification [J].
Diaz-Valenzuela, Irene ;
Loia, Vincenzo ;
Martin-Bautista, Maria J. ;
Senatore, Sabrina ;
Vila, M. Amparo .
SOFT COMPUTING, 2016, 20 (06) :2329-2339
[7]   Fast and effective cluster-based information retrieval using frequent closed itemsets [J].
Djenouri, Youcef ;
Belhadi, Asma ;
Fournier-Viger, Philippe ;
Lin, Jerry Chun-Wei .
INFORMATION SCIENCES, 2018, 453 :154-167
[8]   Sentiment analysis and text categorization of cancer medical records with LSTM [J].
Edara D.C. ;
Vanukuri L.P. ;
Sistla V. ;
Kolli V.K.K. .
Journal of Ambient Intelligence and Humanized Computing, 2023, 14 (05) :5309-5325
[9]   A Selection Metric for semi-supervised learning based on neighborhood construction [J].
Emadi, Mona ;
Tanha, Jafar ;
Shiri, Mohammad Ebrahim ;
Aghdam, Mehdi Hosseinzadeh .
INFORMATION PROCESSING & MANAGEMENT, 2021, 58 (02)
[10]   Local homogeneous consistent safe semi-supervised clustering [J].
Gan, Haitao ;
Fan, Yingle ;
Luo, Zhizeng ;
Zhang, Qizhong .
EXPERT SYSTEMS WITH APPLICATIONS, 2018, 97 :384-393