CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

被引：0

作者：

Furst, Andreas ^{[1
,2
]}

Rumetshofer, Elisabeth ^{[1
,2
]}

Lehner, Johannes ^{[1
,2
]}

Tran, Viet ^{[1
,2
]}

Tang, Fei ^{[4
]}

Ramsauer, Hubert ^{[1
,2
]}

Kreil, David ^{[3
]}

Kopp, Michael ^{[3
]}

Klambauer, Gunter ^{[1
,2
]}

Bitto-Nemling, Angela ^{[1
,2
]}

Hochreiter, Sepp ^{[1
,2
,3
]}

机构：

[1] Johannes Kepler Univ Linz, Inst Machine Learning, ELLIS Unit Linz, Linz, Austria

[2] Johannes Kepler Univ Linz, Inst Machine Learning, LIT AI Lab, Linz, Austria

[3] Inst Adv Res Artificial Intelligence IARAI, Vienna, Austria

[4] HERE Technol, Zurich, Switzerland

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年

基金：

欧盟地平线“2020”;

关键词：

DEEP LEARNING BENCHMARK; NEURAL-NETWORKS; LAND-USE; UNIFORMITY; EUROSAT; DATASET;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

CLIP yielded impressive results on zero-shot transfer learning tasks and is considered as a foundation model like BERT or GPT3. CLIP vision models that have a rich representation are pre-trained using the InfoNCE objective and natural language supervision before they are fine-tuned on particular tasks. Though CLIP excels at zero-shot transfer learning, it suffers from an explaining away problem, that is, it focuses on one or few features, while neglecting other relevant features. This problem is caused by insufficiently extracting the covariance structure in the original multi-modal data. We suggest to use modern Hopfield networks to tackle the problem of explaining away. Their retrieved embeddings have an enriched covariance structure derived from co-occurrences of features in the stored embeddings. However, modern Hopfield networks increase the saturation effect of the InfoNCE objective which hampers learning. We propose to use the InfoLOOB objective to mitigate this saturation effect. We introduce the novel "Contrastive Leave One Out Boost" (CLOOB), which uses modern Hopfield networks for covariance enrichment together with the InfoLOOB objective. In experiments we compare CLOOB to CLIP after pre-training on the Conceptual Captions and the YFCC dataset with respect to their zero-shot transfer learning performance on other datasets. CLOOB consistently outperforms CLIP at zero-shot transfer learning across all considered architectures and datasets.

引用

页数：19

共 112 条

[1] STORAGE CAPACITY OF GENERALIZED NETWORKS [J].

ABBOTT, LF ;

ARIAN, Y .

PHYSICAL REVIEW A, 1987, 36 (10) :5091-5094

[2]

Agarwal Sandhini, 2021, ARXIV210802818

[3]

AJNE B, 1968, BIOMETRIKA, V55, P343, DOI 10.2307/2334875

[4] LEARNING PATTERNS AND PATTERN SEQUENCES BY SELF-ORGANIZING NETS OF THRESHOLD ELEMENTS [J].

AMARI, SI .

IEEE TRANSACTIONS ON COMPUTERS, 1972, C 21 (11) :1197-+

[5]

[Anonymous], 2013, Technical report

[6]

Arbel J., 2019, ARXIV190109188

[7] NUMBER OF STABLE POINTS FOR SPIN-GLASSES AND NEURAL NETWORKS OF HIGHER ORDERS [J].

BALDI, P ;

VENKATESH, SS .

PHYSICAL REVIEW LETTERS, 1987, 58 (09) :913-916

[8]

Bau David, 2021, arXiv preprint arXiv:2103.10951

[9]

Belghazi MI, 2018, PR MACH LEARN RES, V80

[10] Birdsnap: Large-scale Fine-grained Visual Categorization of Birds [J].

Berg, Thomas ;

Liu, Jiongxin ;

Lee, Seung Woo ;

Alexander, Michelle L. ;

Jacobs, David W. ;

Belhumeur, Peter N. .

2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :2019-2026

← 1 2 3 4 5 6 7 8 9 10 →