When CLIP meets cross-modal hashing retrieval: A new strong baseline

被引:36
作者
Xia, Xinyu [1 ,2 ]
Dong, Guohua [1 ]
Li, Fengling [3 ]
Zhu, Lei [2 ]
Ying, Xiaomin [1 ]
机构
[1] Beijing Inst Basic Med Sci, Ctr Computat Biol, Beijing 100850, Peoples R China
[2] Shandong Normal Univ, Sch Informat Sci & Engn, Jinan 250358, Peoples R China
[3] Univ Technol Sydney, Fac Engn & Informat Technol, Australian Artificial Intelligence Inst, Ultimo, NSW 2007, Australia
关键词
Cross-modal retrieval; Hashing; CLIP; Modality fusion; Contrastive learning;
D O I
10.1016/j.inffus.2023.101968
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent days witness significant progress in various multi-modal tasks made by Contrastive Language-Image Pre-training (CLIP), a multi-modal large-scale model that learns visual representations from natural language supervision. However, the potential effects of CLIP on cross-modal hashing retrieval has not been investigated yet. In this paper, we for the first time explore the effects of CLIP on cross-modal hashing retrieval performance and propose a simple but strong baseline Unsupervised Contrastive Multi-modal Fusion Hashing network (UCMFH). We first extract the off-the-shelf visual and linguistic features from the CLIP model, as the input sources for cross-modal hashing functions. To further mitigate the semantic gap between the image and text features, we design an effective contrastive multi-modal learning module that leverages a multi modal fusion transformer encoder supervising by a contrastive loss, to enhance modality interaction while improving the semantic representation of each modality. Furthermore, we design a contrastive hash learning module to produce high-quality modal-correlated hash codes. Experiments show that significant performance improvement can be made by our simple new unsupervised baseline UCMFH compared with state-of-the-art supervised and unsupervised cross-modal hashing methods. Also, our experiments demonstrate the remarkable performance of CLIP features on cross-modal hashing retrieval task compared to deep visual and linguistic features used in existing state-of-the-art methods. The source codes for our approach is publicly available at: https://github.com/XinyuXia97/UCMFH.
引用
收藏
页数:12
相关论文
共 56 条
[1]   CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP [J].
Chen, Runnan ;
Liu, Youquan ;
Kong, Lingdong ;
Zhu, Xinge ;
Ma, Yuexin ;
Li, Yikang ;
Hou, Yuenan ;
Qiao, Yu ;
Wang, Wenping .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, :7020-7030
[2]  
Chua T.-S., 2009, P ACM INT C IM VID R, P1
[3]  
Cong Bai, 2020, ICMR '20: Proceedings of the 2020 International Conference on Multimedia Retrieval, P525, DOI 10.1145/3372278.3390711
[4]   Triplet-Based Deep Hashing Network for Cross-Modal Retrieval [J].
Deng, Cheng ;
Chen, Zhaojia ;
Liu, Xianglong ;
Gao, Xinbo ;
Tao, Dacheng .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2018, 27 (08) :3893-3903
[5]   Collective Matrix Factorization Hashing for Multimodal Data [J].
Ding, Guiguang ;
Guo, Yuchen ;
Zhou, Jile .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :2083-2090
[6]  
He Shiyuan, 2022, IEEE Trans. Knowl. data Eng.
[7]  
Hinton G E., 2012, arXiv
[8]  
Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
[9]   Creating Something from Nothing: Unsupervised Knowledge Distillation for Cross-Modal Hashing [J].
Hu, Hengtong ;
Xie, Lingxi ;
Hong, Richang ;
Tian, Qi .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :3120-3129
[10]  
Huiskes M. J., 2008, Proceedings of the 1st ACM international conference on Multimedia information retrieval, P39