Few-Shot Image Classification Method Based on Visual Language Prompt Learning

被引:0
|
作者
Li B. [1 ]
Wang X. [1 ]
Teng S. [1 ]
Lyu X. [1 ]
机构
[1] Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science & Technology University, Beijing
关键词
few-shot learning; image classification; pretrained model; prompt learning; visual-language model;
D O I
10.13190/j.jbupt.2023-014
中图分类号
学科分类号
摘要
In order to improve the performance and generalization ability of few-shot image classification, a method to efficiently deal with the classification of images with few samples by making full use of large-scale visual language pre-training model is provided. Firstly, in the text of encoding part, multiple learnable text prompts are integrated. The purpose is to fully explore how the positions of image category labels in prompt statements influence model generalization performance. Secondly, a learnable visual prompt is added in the image coding part to make the image pre-training parameters better represent the image with few samples. Finally, a feature adapter is added to the image and text feature encoder, and the network is fine-tuned on the image classification datasets, so that the network can function better on the classification datasets of images with few samples. Through extensive experiments conducted on 10 publicly available datasets, the results demonstrate that, compared to existing methods, this approach has shown an average accuracy improvement of 2.9% in single-sample classification. © 2024 Beijing University of Posts and Telecommunications. All rights reserved.
引用
收藏
页码:11 / 17
页数:6
相关论文
共 20 条
  • [1] HE K M, ZHANG X Y, REN S Q, Et al., Deep residual learning for image recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, (2016)
  • [2] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, Et al., An image is worth 16 伊16 words: transformers for image recognition at scale, 9th International Conference on Learning Representations, pp. 1-12, (2021)
  • [3] RADFORD A, KIM J W, HALLACY C, Et al., Learning transferable visual models from natural language supervision[J / OL], Proceedings of the 38th International Conference on Machine Learning, pp. 8748-8763, (2021)
  • [4] LU H Y, FEI N Y, HUO Y Q, Et al., COTS: collaborative two-stream vision-language pre-training model for cross-modal retrieval, 2022 IEEE/ CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15671-15680, (2022)
  • [5] SCHICK T, SCH? TZE H., Exploiting cloze-questions for few-shot text classification and natural language inference, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 255-269, (2021)
  • [6] SHIN T, RAZEGHI Y, LOGAN R L, Et al., Auto-Prompt: eliciting knowledge from language models with automatically generated prompts, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4222-4235, (2020)
  • [7] ZHOU K Y, YANG J K, LOY C C, Et al., Learning to prompt for vision-language models, International Journal of Computer Vision, 130, 9, pp. 2337-2348, (2022)
  • [8] LESTER B, AL-RFOU R, CONSTANT N., The power of scale for parameter-efficient prompt tuning, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045-3059, (2021)
  • [9] VASWANI A, SHAZEER N, PARMAR N, Et al., Attention is all You need, Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000-6010, (2017)
  • [10] JIA M L, TANG L M, CHEN B C, Et al., Visual prompt tuning, European Conference on Computer Vision, pp. 709-727, (2022)