M3R: Masked Token Mixup and Cross-Modal Reconstruction for Zero-Shot Learning

被引:0
|
作者
Zhao, Peng [1 ]
Wang, Qiangchang [1 ]
Yin, Yilong [1 ]
机构
[1] Shandong Univ, Sch Software, Jinan, Shandong, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
基金
中国国家自然科学基金;
关键词
zero-shot learning; mixup; masked image modeling; Transformer;
D O I
10.1145/3581783.3612104
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the zero-shot learning (ZSL), learned representation spaces are often biased toward seen classes, thus limiting the ability to predict previously unseen classes. In this paper, we propose Masked token Mixup and cross-Modal Reconstruction for zero-shot learning, termed as M3R, which can significantly alleviate the bias toward seen classes. The M3R mainly consists of Random Token Mixup (RTM), Unseen Class Detection (UCD), and Hard Cross-modal Reconstruction (HCR). Firstly, mappings without proper adaptations to unseen classes would cause the bias toward seen classes. To address this issue, the RTM is introduced to generate diverse unseen class agents, thereby broadening the representation space to cover unknown classes. It is applied at a randomly selected layer in the Vision Transformer, producing smooth low- and high-level representation space boundaries to cover rich attributes. Secondly, it should be noted that unseen class agents generated by the RTM may be mixed with seen class samples. To overcome this challenge, the UCD is designed to generate greater entropy values for unseen classes, thereby distinguishing seen classes from unseen classes. Thirdly, to further mitigate the bias toward seen classes and explore associations between semantics and visual images, the HCR is proposed, which can reconstruct masked pixels based on few discriminative tokens and attribute embeddings. This approach can enable models to have a deep understanding of image contents and build powerful connections between semantic attributes and visual information. Both qualitative and quantitative results demonstrate the effectiveness and usefulness of our proposed M3R model.
引用
收藏
页码:3161 / 3171
页数:11
相关论文
共 50 条
  • [1] CROSS-MODAL REPRESENTATION RECONSTRUCTION FOR ZERO-SHOT CLASSIFICATION
    Wang, Yu
    Zhao, Shenjie
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 2820 - 2824
  • [2] Cross-modal Zero-shot Hashing
    Liu, Xuanwu
    Li, Zhao
    Wang, Jun
    Yu, Guoxian
    Domeniconi, Carlotta
    Zhang, Xiangliang
    2019 19TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2019), 2019, : 449 - 458
  • [3] Cross-modal Representation Learning for Zero-shot Action Recognition
    Lin, Chung-Ching
    Lin, Kevin
    Wang, Lijuan
    Liu, Zicheng
    Li, Linjie
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19946 - 19956
  • [4] Manifold regularized cross-modal embedding for zero-shot learning
    Ji, Zhong
    Yu, Yunlong
    Pang, Yanwei
    Guo, Jichang
    Zhang, Zhongfei
    INFORMATION SCIENCES, 2017, 378 : 48 - 58
  • [5] Cross-modal propagation network for generalized zero-shot learning
    Guo, Ting
    Liang, Jianqing
    Liang, Jiye
    Xie, Guo-Sen
    PATTERN RECOGNITION LETTERS, 2022, 159 : 125 - 131
  • [6] Generalized Zero-Shot Cross-Modal Retrieval
    Dutta, Titir
    Biswas, Soma
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (12) : 5953 - 5962
  • [7] DUET: Cross-Modal Semantic Grounding for Contrastive Zero-Shot Learning
    Chen, Zhuo
    Huang, Yufeng
    Chen, Jiaoyan
    Geng, Yuxia
    Zhang, Wen
    Fang, Yin
    Pan, Jeff Z.
    Chen, Huajun
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 405 - 413
  • [8] Learning Aligned Cross-Modal Representation for Generalized Zero-Shot Classification
    Fang, Zhiyu
    Zhu, Xiaobin
    Yang, Chun
    Han, Zheng
    Qin, Jingyan
    Yin, Xu-Cheng
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 6605 - 6613
  • [9] Cross-modal prototype learning for zero-shot handwritten character recognition
    Ao, Xiang
    Zhang, Xu-Yao
    Liu, Cheng-Lin
    PATTERN RECOGNITION, 2022, 131
  • [10] A Cross-Modal Alignment for Zero-Shot Image Classification
    Wu, Lu
    Wu, Chenyu
    Guo, Han
    Zhao, Zhihao
    IEEE ACCESS, 2023, 11 : 9067 - 9073