HashFormer: Vision Transformer Based Deep Hashing for Image Retrieval

被引：53

作者：

Li, Tao ^{[1
]}

Zhang, Zheng ^{[1
]}

Pei, Lishen ^{[2
]}

Gan, Yan ^{[3
]}

机构：

[1] Open Univ Henan, Zhengzhou 450046, Peoples R China

[2] Henan Univ Econ & Law, Zhengzhou 450046, Peoples R China

[3] Chongqing Univ, Chongqing 400044, Peoples R China

来源：

IEEE SIGNAL PROCESSING LETTERS | 2022年 / 29卷

基金：

中国博士后科学基金; 中国国家自然科学基金;

关键词：

Transformers; Binary codes; Task analysis; Training; Image retrieval; Feature extraction; Databases; Binary embedding; image retrieval;

D O I：

10.1109/LSP.2022.3157517

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Deep image hashing aims to map an input image to compact binary codes by deep neural network, to enable efficient image retrieval across large-scale dataset. Due to the explosive growth of modern data, deep hashing has gained growing attention from research community. Recently, convolutional neural networks like ResNet have dominated in deep hashing. Nevertheless, motivated by the recent advancements of vision transformers, we propose a pure transformer-based framework, called as HashFormer, to tackle the deep hashing task. Specifically, we utilize vision transformer (ViT) as our backbone, and treat binary codes as the intermediate representations for our surrogate task, i.e., image classification. In addition, we observe that the binary codes suitable for classification are sub-optimal for retrieval. To mitigate this problem, we present a novel average precision loss, which enables us to directly optimize the retrieval accuracy. To the best of our knowledge, our work is one of the pioneer works to address deep hashing learning problems without convolutional neural networks (CNNs). We perform comprehensive experiments on three widely-studied datasets: CIFAR-10, NUSWIDE and ImageNet. The proposed method demonstrates promising results against existing state-of-the-art works, validating the advantages and merits of our HashFormer.

引用

页码：827 / 831

页数：5

共 34 条

[1] Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval [J].

Brown, Andrew ;

Xie, Weidi ;

Kalogeiton, Vicky ;

Zisserman, Andrew .

COMPUTER VISION - ECCV 2020, PT IX, 2020, 12354 :677-694

[2] Deep Cauchy Hashing for Hamming Space Retrieval [J].

Cao, Yue ;

Long, Mingsheng ;

Liu, Bin ;

Wang, Jianmin .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1229-1237

[3] HashNet: Deep Learning to Hash by Continuation [J].

Cao, Zhangjie ;

Long, Mingsheng ;

Wang, Jianmin ;

Yu, Philip S. .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5609-5618

[4] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[5]

Carreira-Perpiñán MA, 2015, PROC CVPR IEEE, P557, DOI 10.1109/CVPR.2015.7298654

[6]

Chen C.-F. R., 2021, P IEEE CVF INT C COM, P347

[7]

Chen Yongbiao, 2021, ARXIV210501823

[8]

Cheng B., 2020, ADV NEURAL INF PROCE, V33

[9]

Chua T.-S., 2009, P ACM INT C IM VID R

[10]

Dosovitskiy A, 2020, ARXIV

← 1 2 3 4 →