I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification

被引：18

作者：

Naeem, Muhammad Ferjad ^{[1
]}

Khan, Muhammad Gul Zain Ali ^{[2
,3
]}

Xian, Yongqin ^{[5
]}

Afzal, Muhammad Zeshan ^{[2
,3
]}

Stricker, Didier ^{[2
,3
]}

Van Gool, Luc ^{[1
]}

Tombari, Federico ^{[4
,5
]}

机构：

[1] Swiss Fed Inst Technol, Zurich, Switzerland

[2] TUKL, Kaiserslautern, Germany

[3] DFKI, Kaiserslautern, Germany

[4] TUM, Munich, Germany

[5] Google, Hamburg, Germany

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.01456

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent works have shown that unstructured text (documents) from online sources can serve as useful auxiliary information for zero-shot image classification. However, these methods require access to a high-quality source like Wikipedia and are limited to a single source of information. Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowledge for a multitude of tasks. In this work, we provide a novel perspective on using an LLM to provide text supervision for a zero-shot image classification model. The LLM is provided with a few text descriptions from different annotators as examples. The LLM is conditioned on these examples to generate multiple text descriptions for each class (referred to as views). Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views. We show that each text view of a class provides complementary information allowing a model to learn a highly discriminative class embedding. Moreover, we show that I2MVFormer is better at consuming the multi-view text supervision from LLM compared to baseline models. I2MVFormer establishes a new state-of-the-art on three public benchmark datasets for zero-shot image classification with unsupervised semantic embeddings. Code available at https://github.com/ferjad/I2DFormer

引用

页码：15169 / 15179

页数：11

共 67 条

[1]

Al-Halah Ziad., 2017, CVPR

[2]

Alayrac J.-B., 2022, P ADV NEUR INF PROC, P23716

[3]

[Anonymous], 2009, P IEEE C COMPUTER VI

[4]

[Anonymous], 1988, Information processing management

[5]

[Anonymous], 2008, 2008 6 INDIAN C COMP, DOI [DOI 10.1109/ICVGIP.2008.47, 10.1109/ICVGIP.2008.47]

[6]

Beltagy I, 2020, ARXIV PREPRINT ARXIV, DOI DOI 10.48550/ARXIV.2004.05150

[7]

Brendel W., 2019, INT C LEARN REPR

[8]

Brown TB, 2020, ARXIV, DOI DOI 10.48550/ARXIV.2005.14165

[9] Generating Visual Representations for Zero-Shot Classification [J].

Bucher, Maxime ;

Herbin, Stephane ;

Jurie, Frederic .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, :2666-2673

[10]

Bujwid S., 2021, LANTERN

← 1 2 3 4 5 6 7 →