Deep Supervised Hashing Image Retrieval Method Based on Swin Transformer

被引：0

作者：

Miao Z. ^{[1
]}

Zhao X. ^{[1
]}

Li Y. ^{[1
]}

Wang J. ^{[1
]}

Zhang R. ^{[1
]}

机构：

[1] Command and Control Engineering College, Army Engineering University of PLA, Nanjing

来源：

Hunan Daxue Xuebao/Journal of Hunan University Natural Sciences | 2023年 / 50卷 / 08期

基金：

中国国家自然科学基金;

关键词：

deep learning; hash learning; image retrieval; Swin Transformer;

D O I：

10.16339/j.cnki.hdxbzkb.2023274

中图分类号：

学科分类号：

摘要：

The feature extraction process in deep supervised Hash image retrieval has been dominated by the convolutional neural network architecture. However, with the application of Transformer in the field of vision, it becomes possible to replace the convolutional neural network architecture with Transformer. In order to address the limitations of existing Transformer-based hashing methods, such as the inability to generate hierarchical representations and high computational complexity, a deep supervised hash image retrieval method based on Swin Transformer is proposed. The proposed method utilizes the Swin Transformer network model, and incorporates a hash layer at the end of the network to generate hash encode for images. By introducing the concepts of locality and hierarchy into the model, the method effectively solve the above problems. Compared with 13 existing state-of-the-art methods, the method proposed in this paper has greatly improved the performance of hash retrieval. Experiments are carried out on two commonly used retrieval datasets, namely CIFAR-10 and NUS-WIDE. The experimental results show that the proposed method achieves the highest mean average precision(mAP)of 98.4% on the CIFAR-10 dataset. This represents an average increase of 7.1% compared with the TransHash method and an average increase of 0.57% compared with the VTS16-CSQ method. On the NUS-WIDE dataset, the proposed method achieves the highest mAP of 93.6%. This corresponds to an average improvement of 18.61% compared with the TransHash method, and an average increase of 8.6% in retrieval accuracy compared with the VTS16-CSQ method. © 2023 Hunan University. All rights reserved.

引用

页码：62 / 71

页数：9

共 44 条

[31]

BROWN T B, MANN B, RYDER N, Et al., Language models are few-shot learners[C], Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 1877-1901, (2020)

[32]

DEVLIN J, CHANG M W, LEE K, Et al., BERT：pre-training of deep bidirectional transformers for language understanding[C], Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171-4186, (2019)

[33]

DUBEY S R, SINGH S K, CHU W T., Vision transformer hashing for image retrieval[C], 2022 IEEE International Conference on Multimedia and Expo(ICME), pp. 1-6, (2022)

[34]

LIU Z, LIN Y T, CAO Y, Et al., Swin transformer：hierarchical vision transformer using shifted windows[C], 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992-10002, (2022)

[35]

CHU X, TIAN Z, WANG Y, Et al., Twins：revisiting the design of spatial attention in vision transformers, (2021)

[36]

CHEN Y B, ZHANG S, LIU F X, Et al., TransHash：transformerbased hamming hashing for efficient image retrieval [C], Proceedings of the 2022 International Conference on Multimedia Retrieval, (2022)

[37]

CHUA T S, TANG J H, HONG R C, Et al., NUS-WIDE：a real-world web image database from National University of Singapore [C], Proceedings of the ACM International Conference on Image and Video Retrieval, pp. 1-9, (2009)

[38]

LIU W, WANG J, JI R R, Et al., Supervised hashing with kernels [C], 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2074-2081, (2012)

[39]

KRIZHEVSKY A, HINTON G., Learning multiple layers of features from tiny images, Handbook of Systemic Autoimmune Diseases, 1, 4, (2009)

[40]

GONG Y C, LAZEBNIK S., Iterative quantization：a procrustean approach to learning binary codes, CVPR, pp. 817-824, (2011)

← 1 2 3 4 5 →