When CLIP meets cross-modal hashing retrieval: A new strong baseline

被引：36

作者：

Xia, Xinyu ^{[1
,2
]}

Dong, Guohua ^{[1
]}

Li, Fengling ^{[3
]}

Zhu, Lei ^{[2
]}

Ying, Xiaomin ^{[1
]}

机构：

[1] Beijing Inst Basic Med Sci, Ctr Computat Biol, Beijing 100850, Peoples R China

[2] Shandong Normal Univ, Sch Informat Sci & Engn, Jinan 250358, Peoples R China

[3] Univ Technol Sydney, Fac Engn & Informat Technol, Australian Artificial Intelligence Inst, Ultimo, NSW 2007, Australia

来源：

INFORMATION FUSION | 2023年 / 100卷

关键词：

Cross-modal retrieval; Hashing; CLIP; Modality fusion; Contrastive learning;

D O I：

10.1016/j.inffus.2023.101968

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent days witness significant progress in various multi-modal tasks made by Contrastive Language-Image Pre-training (CLIP), a multi-modal large-scale model that learns visual representations from natural language supervision. However, the potential effects of CLIP on cross-modal hashing retrieval has not been investigated yet. In this paper, we for the first time explore the effects of CLIP on cross-modal hashing retrieval performance and propose a simple but strong baseline Unsupervised Contrastive Multi-modal Fusion Hashing network (UCMFH). We first extract the off-the-shelf visual and linguistic features from the CLIP model, as the input sources for cross-modal hashing functions. To further mitigate the semantic gap between the image and text features, we design an effective contrastive multi-modal learning module that leverages a multi modal fusion transformer encoder supervising by a contrastive loss, to enhance modality interaction while improving the semantic representation of each modality. Furthermore, we design a contrastive hash learning module to produce high-quality modal-correlated hash codes. Experiments show that significant performance improvement can be made by our simple new unsupervised baseline UCMFH compared with state-of-the-art supervised and unsupervised cross-modal hashing methods. Also, our experiments demonstrate the remarkable performance of CLIP features on cross-modal hashing retrieval task compared to deep visual and linguistic features used in existing state-of-the-art methods. The source codes for our approach is publicly available at: https://github.com/XinyuXia97/UCMFH.

引用

页数：12

共 56 条

[1] CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP [J].

Chen, Runnan ;

Liu, Youquan ;

Kong, Lingdong ;

Zhu, Xinge ;

Ma, Yuexin ;

Li, Yikang ;

Hou, Yuenan ;

Qiao, Yu ;

Wang, Wenping .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, :7020-7030

[2]

Chua T.-S., 2009, P ACM INT C IM VID R, P1

[3]

Cong Bai, 2020, ICMR '20: Proceedings of the 2020 International Conference on Multimedia Retrieval, P525, DOI 10.1145/3372278.3390711

[4] Triplet-Based Deep Hashing Network for Cross-Modal Retrieval [J].

Deng, Cheng ;

Chen, Zhaojia ;

Liu, Xianglong ;

Gao, Xinbo ;

Tao, Dacheng .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2018, 27 (08) :3893-3903

[5] Collective Matrix Factorization Hashing for Multimodal Data [J].

Ding, Guiguang ;

Guo, Yuchen ;

Zhou, Jile .

2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :2083-2090

[6]

He Shiyuan, 2022, IEEE Trans. Knowl. data Eng.

[7]

Hinton G E., 2012, arXiv

[8]

Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]

[9] Creating Something from Nothing: Unsupervised Knowledge Distillation for Cross-Modal Hashing [J].

Hu, Hengtong ;

Xie, Lingxi ;

Hong, Richang ;

Tian, Qi .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :3120-3129

[10]

Huiskes M. J., 2008, Proceedings of the 1st ACM international conference on Multimedia information retrieval, P39

← 1 2 3 4 5 6 →