Preserving text space integrity for robust compositional zero-shot learning via mixture of pretrained experts

被引:0
作者
Hao, Zehua
Liu, Fang [1 ]
Jiao, Licheng
Du, Yaoyang
Li, Shuo
Wang, Hao
Li, Pengfang
Liu, Xu
Chen, Puhua
机构
[1] Xidian Univ, Sch Artificial Intelligent, 2 Taibai South Rd, Xian 710071, Shaanxi, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
Compositional zero-shot learning; Mixture of pretrained expert; Deep learning; IMAGE; RECOGNITION; FUSION; VIDEO; MODEL;
D O I
10.1016/j.neucom.2024.128773
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the current landscape of Compositional Zero-Shot Learning (CZSL) methods that leverage CLIP, predominant approach is based on prompt learning paradigms. These methods encounter significant putational complexity when dealing with a large number of categories. Additionally, when confronted new classification tasks, there is a necessity to learn the prompts again, which can be both time-consuming and resource-intensive. To address these challenges, We present anew methodology, named the M ixture P retrained E xpert (MoPE), for enhancing Compositional Zero-shot Learning through Logit-Level Fusion Multi Expert Fusion Module. The MoPE skillfully blends the benefits of extensive pre-trained models like Bert, GPT-3 and Word2Vec for effectively tackling Compositional Zero-shot Learning. Firstly, we extract text label space for each language model individually, then map the visual feature vectors to their respective text spaces. This maintains the integrity and structure of the original text space. During this process, pre-trained expert parameters are kept frozen. The mapping of visual features to the corresponding text spaces is subject to learning and could be considered as multiple learnable visual experts. In the model fusion phase, we propose anew fusion strategy that features a gating mechanism that adjusts the contributions of various models dynamically. This enables our approach to adapt more effectively to a range of tasks and data sets. method's robustness is demonstrated by the fact that the language model is not tailored to specific downstream task datasets or losses. This preserves the larger model's topology and expands the potential for application. Preliminary experiments conducted on the UT-Zappos, AO-Clever, and C-GQA datasets indicate that MoPE performs competitively when compared to existing techniques.
引用
收藏
页数:12
相关论文
共 78 条
  • [1] Atzmon Y, 2020, ADV NEUR IN, V33
  • [2] Improving Color Constancy Using Indoor-Outdoor Image Classification
    Bianco, Simone
    Ciocca, Gianluigi
    Cusano, Claudio
    Schettini, Raimondo
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2008, 17 (12) : 2381 - 2392
  • [3] CURL: Image Classification using co-training and Unsupervised Representation Learning
    Bianco, Simone
    Ciocca, Gianluigi
    Cusano, Claudio
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2016, 145 : 15 - 29
  • [4] Brown TB, 2020, ADV NEUR IN, V33
  • [5] Chaleshtori AE, 2024, Arxiv, DOI arXiv:2403.20033
  • [6] Skew t mixture of experts
    Chamroukhi, F.
    [J]. NEUROCOMPUTING, 2017, 266 : 390 - 408
  • [7] Chen KY, 2024, Arxiv, DOI arXiv:2405.05493
  • [8] Incorporating attribute-level aligned comparative network for generalized zero-shot learning
    Chen, Yuan
    Zhou, Yuan
    [J]. NEUROCOMPUTING, 2024, 573
  • [9] Hadamard Adapter: An Extreme Parameter-Efficient Adapter Tuning Method for Pre-trained Language Models
    Chen, Yuyan
    Fu, Qiang
    Fan, Ge
    Du, Lun
    Lou, Jian-Guang
    Han, Shi
    Zhang, Dongmei
    Li, Zhixu
    Xiao, Yanghua
    [J]. PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 276 - 285
  • [10] On the use of supervised features for unsupervised image categorization: An evaluation
    Ciocca, Gianluigi
    Cusano, Claudio
    Santini, Simone
    Schettini, Raimondo
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2014, 122 : 155 - 171