TransZero plus plus : Cross Attribute-Guided Transformer for Zero-Shot Learning

被引:37
作者
Chen, Shiming [1 ]
Hong, Ziming [1 ]
Hou, Wenjin [1 ]
Xie, Guo-Sen [2 ]
Song, Yibing [3 ]
Zhao, Jian [4 ,5 ]
You, Xinge [1 ]
Yan, Shuicheng [6 ]
Shao, Ling [7 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan 430074, Peoples R China
[2] Nanjing Univ Sci & Technol, Nanjing 210094, Peoples R China
[3] Fudan Univ, AI3 Inst, Shanghai 200437, Peoples R China
[4] Inst North Elect Equipment, Beijing 100190, Peoples R China
[5] Peng Cheng Lab, Shenzhen 518066, Peoples R China
[6] Sea AI Lab, Singapore 138522, Singapore
[7] Terminus Grp, Beijing 100027, Peoples R China
关键词
Visualization; Semantics; Transformers; Federated learning; Location awareness; Knowledge transfer; Task analysis; Attribute localization; semantic-augmented visual embedding; semantical collaborative learning; transformer; zero-shot learning;
D O I
10.1109/TPAMI.2022.3229526
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Zero-shot learning (ZSL) tackles the novel class recognition problem by transferring semantic knowledge from seen classes to unseen ones. Semantic knowledge is typically represented by attribute descriptions shared between different classes, which act as strong priors for localizing object attributes that represent discriminative region features, enabling significant and sufficient visual-semantic interaction for advancing ZSL. Existing attention-based models have struggled to learn inferior region features in a single image by solely using unidirectional attention, which ignore the transferable and discriminative attribute localization of visual features for representing the key semantic knowledge for effective knowledge transfer in ZSL. In this paper, we propose a cross attribute-guided Transformer network, termed TransZero++, to refine visual features and learn accurate attribute localization for key semantic knowledge representations in ZSL. Specifically, TransZero++ employs an attribute. visual Transformer sub-net (AVT) and a visual. attribute Transformer sub-net (VAT) to learn attribute-based visual features and visual-based attribute features, respectively. By further introducing feature-level and prediction-level semantical collaborative losses, the two attribute-guided transformers teach each other to learn semantic-augmented visual embeddings for key semantic knowledge representations via semantical collaborative learning. Finally, the semantic-augmented visual embeddings learned by AVT and VAT are fused to conduct desirable visual-semantic interaction cooperated with class semantic vectors for ZSL classification. Extensive experiments show that TransZero++ achieves the new state-of-the-art results on three golden ZSL benchmarks and on the large-scale ImageNet dataset.
引用
收藏
页码:12844 / 12861
页数:18
相关论文
共 94 条
[1]   Label-Embedding for Image Classification [J].
Akata, Zeynep ;
Perronnin, Florent ;
Harchaoui, Zaid ;
Schmid, Cordelia .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2016, 38 (07) :1425-1438
[2]  
Akata Z, 2015, PROC CVPR IEEE, P2927, DOI 10.1109/CVPR.2015.7298911
[3]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[4]  
Atzmon Y, 2020, ADV NEUR IN, V33
[5]  
Badirli S, 2021, ADV NEUR IN, V34
[6]   Zero-Shot Object Detection [J].
Bansal, Ankan ;
Sikka, Karan ;
Sharma, Gaurav ;
Chellappa, Rama ;
Divakaran, Ajay .
COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 :397-414
[7]  
Batra T, 2017, Arxiv, DOI arXiv:1705.05512
[8]  
Bucher M, 2019, ADV NEUR IN, V32
[9]  
Cetin S., 2022, P INT C LEARN REPR
[10]   Predicting Visual Exemplars of Unseen Classes for Zero-Shot Learning [J].
Changpinyo, Soravit ;
Chao, Wei-Lun ;
Sha, Fei .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :3496-3505