SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision

被引:0
作者
Vani, Ankit [1 ]
Nguyen, Bac [2 ]
Lavoie, Samuel [1 ]
Krishna, Ranjay [3 ]
Courville, Aaron [1 ,4 ]
机构
[1] Univ Montreal, Mila, Montreal, PQ, Canada
[2] Sony AI, Stuttgart, Germany
[3] Univ Washington, Allen Inst Artificial Intelligence, Seattle, WA 98195 USA
[4] CIFAR AI Chair, Montreal, PQ, Canada
来源
COMPUTER VISION - ECCV 2024, PT LXVI | 2025年 / 15124卷
关键词
Selective attention; Slot representations; Transformers;
D O I
10.1007/978-3-031-72848-8_14
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Selective attention helps us focus on task-relevant aspects in the constant flood of our sensory input. This constraint in our perception allows us to robustly generalize under distractions and to new compositions of perceivable concepts. Transformers employ a similar notion of attention in their architecture, but representation learning models with transformer backbones like CLIP and DINO often fail to demonstrate robustness and compositionality. We highlight a missing architectural prior: unlike human perception, transformer encodings do not separately attend over individual concepts. In response, we propose Sparo, a read-out mechanism that partitions encodings into separately-attended slots, each produced by a single attention head. Using Sparo with CLIP imparts an inductive bias that the vision and text modalities are different views of a shared compositional world with the same corresponding concepts. Using Sparo, we demonstrate improvements on downstream recognition, robustness, retrieval, and compositionality benchmarks with CLIP (up to +14% for ImageNet, +4% for SugarCrepe), and on nearest neighbors and linear probe for ImageNet with DINO (+3% each). We also showcase a powerful ability to intervene and select individual Sparo concepts to further improve downstream task performance (up from +4% to +9% for SugarCrepe) and use this ability to study the robustness of Sparo's representation structure. Finally, we provide insights through ablation experiments and visualization of learned concepts.
引用
收藏
页码:233 / 251
页数:19
相关论文
共 80 条
  • [1] Alexey D, 2020, arXiv, DOI [arXiv:2010.11929, DOI 10.48550/ARXIV.2010.11929]
  • [2] Aydemir G, 2023, Arxiv, DOI arXiv:2310.06907
  • [3] Ba J.L., 2016, arXiv
  • [4] Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473, DOI 10.48550/ARXIV.1409.0473]
  • [5] Barbu A, 2019, ADV NEUR IN, V32
  • [6] Beattie C, 2016, Arxiv, DOI arXiv:1612.03801
  • [7] Boff K., 1986, HDB PERCEPTION HUMAN, V1
  • [8] Brady J, 2023, Arxiv, DOI arXiv:2305.14229
  • [9] Emerging Properties in Self-Supervised Vision Transformers
    Caron, Mathilde
    Touvron, Hugo
    Misra, Ishan
    Jegou, Herve
    Mairal, Julien
    Bojanowski, Piotr
    Joulin, Armand
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9630 - 9640
  • [10] Chang HS, 2023, Arxiv, DOI arXiv:2210.05043