Image clustering using generated text centroids

被引:3
作者
Kong, Daehyeon [1 ,2 ]
Kong, Kyeongbo [3 ]
Kang, Suk-Ju [2 ]
机构
[1] NAVER, Seongnam 13561, South Korea
[2] Sogang Univ, Seoul 04107, South Korea
[3] Pusan Natl Univ, Busan 46241, South Korea
关键词
Deep neural network; Image clustering; Multimodal task; Vision-language model;
D O I
10.1016/j.image.2024.117128
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In recent years, deep neural networks pretrained on large-scale datasets have been used to address data deficiency and achieve better performance through prior knowledge. Contrastive language-image pretraining (CLIP), a vision-language model pretrained on an extensive dataset, achieves better performance in image recognition. In this study, we harness the power of multimodality in image clustering tasks, shifting from a single modality to a multimodal framework using the describability property of image encoder of the CLIP model. The importance of this shift lies in the ability of multimodality to provide richer feature representations. By generating text centroids corresponding to image features, we effectively create a common descriptive language for each cluster. It generates text centroids assigned by the image features and improves the clustering performance. The text centroids use the results generated by using the standard clustering algorithm as a pseudo-label and learn a common description of each cluster. Finally, only text centroids were added when the image features on the same space were assigned to the text centroids, but the clustering performance improved significantly compared to the standard clustering algorithm, especially on complex datasets. When the proposed method is applied, the normalized mutual information score rises by 32% on the Stanford40 dataset and 64% on ImageNet-Dog compared to the k-means clustering algorithm.
引用
收藏
页数:7
相关论文
共 35 条
[1]   Deep Clustering for Unsupervised Learning of Visual Features [J].
Caron, Mathilde ;
Bojanowski, Piotr ;
Joulin, Armand ;
Douze, Matthijs .
COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 :139-156
[2]  
Chang JL, 2017, IEEE I CONF COMP VIS, P5880, DOI [10.1109/ICCV.2017.626, 10.1109/ICCV.2017.627]
[3]  
Chen T, 2020, PR MACH LEARN RES, V119
[4]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[5]  
Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[6]  
Grill J. B., 2020, Advances in Neural Information Processing Systems
[7]  
Gu‚rin J, 2018, Arxiv, DOI arXiv:1804.04572
[8]   Combining pretrained CNN feature extractors to enhance clustering of complex natural images [J].
Guerin, Joris ;
Thiery, Stephane ;
Nyiri, Eric ;
Gibaru, Olivier ;
Boots, Byron .
NEUROCOMPUTING, 2021, 423 (423) :551-571
[9]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[10]   CNN-Based Joint Clustering and Representation Learning with Feature Drift Compensation for Large-Scale Image Data [J].
Hsu, Chih-Chung ;
Lin, Chia-Wen .
IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (02) :421-429