Texts as Images in Prompt Tuning for Multi-Label Image Recognition

被引:28
作者
Guo, Zixian [1 ,2 ]
Dong, Bowen [1 ]
Ji, Zhilong [2 ]
Bai, Jinfeng [2 ]
Guo, Yiwen
Zuo, Wangmeng [1 ,2 ]
机构
[1] Harbin Inst Technol, Harbin, Peoples R China
[2] Tomorrow Adv Life, Beijing, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52729.2023.00275
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Prompt tuning has been employed as an efficient way to adapt large vision-language pre-trained models (e.g. CLIP) to various downstream tasks in data-limited or label-limited settings. Nonetheless, visual data (e.g., images) is by default prerequisite for learning prompts in existing methods. In this work, we advocate that the effectiveness of image-text contrastive learning in aligning the two modalities (for training CLIP) further makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting. In contrast to the visual data, text descriptions are easy to collect, and their class labels can be directly derived. Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning. Moreover, with TaI, double-grained prompt tuning (TaI-DPT) is further presented to extract both coarse-grained and fine-grained embeddings for enhancing the multi-label recognition performance. Experimental results show that our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks, e.g., MS-COCO, VOC2007, and NUS-WIDE, while it can be combined with existing methods of prompting from images to improve recognition performance further. The code is released at https://github.com/guozix/TaI-DPT.
引用
收藏
页码:2808 / 2817
页数:10
相关论文
共 42 条
[1]  
Alayrac Jean-Baptiste, 2022, P NEURIPS NEW ORL
[2]   LaSO: Label-Set Operations networks for multi-label few-shot learning [J].
Alfassy, Amit ;
Karlinsky, Leonid ;
Aides, Amit ;
Shtok, Joseph ;
Harary, Sivan ;
Feris, Rogerio ;
Giryes, Raja ;
Bronstein, Alex M. .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :6541-6550
[3]  
Ben-Baruch Emanuel, 2021, ASYMMETRIC LOSS MULT
[4]  
Bird Steven., 2009, Natural language processing with Python: analyzing text with the natural language toolkit
[5]  
Chen TS, 2022, AAAI CONF ARTIF INTE, P339
[6]   Learning Semantic-Specific Graph Representation for Multi-Label Image Recognition [J].
Chen, Tianshui ;
Xu, Muxin ;
Hui, Xiaolu ;
Wu, Hefeng ;
Lin, Liang .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :522-531
[7]   Learning Implicit Fields for Generative Shape Modeling [J].
Chen, Zhiqin ;
Zhang, Hao .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :5932-5941
[8]  
Chua T., 2009, P ACM INT C IM VID R, P1
[9]   Learning a Deep ConvNet for Multi-label Classification with Partial Labels [J].
Durand, Thibaut ;
Mehrasa, Nazanin ;
Mori, Greg .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :647-657
[10]   The Pascal Visual Object Classes (VOC) Challenge [J].
Everingham, Mark ;
Van Gool, Luc ;
Williams, Christopher K. I. ;
Winn, John ;
Zisserman, Andrew .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2010, 88 (02) :303-338