When CLIP meets cross-modal hashing retrieval: A new strong baseline

被引:36
作者
Xia, Xinyu [1 ,2 ]
Dong, Guohua [1 ]
Li, Fengling [3 ]
Zhu, Lei [2 ]
Ying, Xiaomin [1 ]
机构
[1] Beijing Inst Basic Med Sci, Ctr Computat Biol, Beijing 100850, Peoples R China
[2] Shandong Normal Univ, Sch Informat Sci & Engn, Jinan 250358, Peoples R China
[3] Univ Technol Sydney, Fac Engn & Informat Technol, Australian Artificial Intelligence Inst, Ultimo, NSW 2007, Australia
关键词
Cross-modal retrieval; Hashing; CLIP; Modality fusion; Contrastive learning;
D O I
10.1016/j.inffus.2023.101968
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent days witness significant progress in various multi-modal tasks made by Contrastive Language-Image Pre-training (CLIP), a multi-modal large-scale model that learns visual representations from natural language supervision. However, the potential effects of CLIP on cross-modal hashing retrieval has not been investigated yet. In this paper, we for the first time explore the effects of CLIP on cross-modal hashing retrieval performance and propose a simple but strong baseline Unsupervised Contrastive Multi-modal Fusion Hashing network (UCMFH). We first extract the off-the-shelf visual and linguistic features from the CLIP model, as the input sources for cross-modal hashing functions. To further mitigate the semantic gap between the image and text features, we design an effective contrastive multi-modal learning module that leverages a multi modal fusion transformer encoder supervising by a contrastive loss, to enhance modality interaction while improving the semantic representation of each modality. Furthermore, we design a contrastive hash learning module to produce high-quality modal-correlated hash codes. Experiments show that significant performance improvement can be made by our simple new unsupervised baseline UCMFH compared with state-of-the-art supervised and unsupervised cross-modal hashing methods. Also, our experiments demonstrate the remarkable performance of CLIP features on cross-modal hashing retrieval task compared to deep visual and linguistic features used in existing state-of-the-art methods. The source codes for our approach is publicly available at: https://github.com/XinyuXia97/UCMFH.
引用
收藏
页数:12
相关论文
共 56 条
[21]   Infrared and Visible Cross-Modal Image Retrieval Through Shared Features [J].
Liu, Fangcen ;
Gao, Chenqiang ;
Sun, Yongqing ;
Zhao, Yue ;
Yang, Feng ;
Qin, Anyong ;
Meng, Deyu .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (11) :4485-4496
[22]   Joint-modal Distribution-based Similarity Hashing for Large-scale Unsupervised Deep Cross-modal Retrieval [J].
Liu, Song ;
Qian, Shengsheng ;
Guan, Yang ;
Zhan, Jiawei ;
Ying, Long .
PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, :1379-1388
[23]   MTFH: A Matrix Tri-Factorization Hashing Framework for Efficient Cross-Modal Retrieval [J].
Liu, Xin ;
Hu, Zhikai ;
Ling, Haibin ;
Cheung, Yiu-Ming .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (03) :964-981
[24]  
Liu XW, 2021, PR MACH LEARN RES, V139
[25]   Late Fusion Incomplete Multi-View Clustering [J].
Liu, Xinwang ;
Zhu, Xinzhong ;
Li, Miaomiao ;
Wang, Lei ;
Tang, Chang ;
Yin, Jianping ;
Shen, Dinggang ;
Wang, Huaimin ;
Gao, Wen .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (10) :2410-2423
[26]  
Liu XW, 2019, AAAI CONF ARTIF INTE, P4400
[27]   Online Multi-modal Hashing with Dynamic Query-adaption [J].
Lu, Xu ;
Zhu, Lei ;
Cheng, Zhiyong ;
Nie, Liqiang ;
Zhang, Huaxiang .
PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, :715-724
[28]  
Radford A, 2021, PR MACH LEARN RES, V139
[29]  
Sauer A., 2023, arXiv
[30]  
Shen HT, 2021, IEEE T KNOWL DATA EN, V33, P3351, DOI [10.1109/TKDE.2020.2970050, 10.1109/TNNLS.2020.2995708]