How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

被引:9
作者
Li, Yiran [1 ]
Wang, Junpeng [2 ]
Dai, Xin [2 ]
Wang, Liang [2 ]
Yeh, Chin-Chia Michael [2 ]
Zheng, Yan [2 ]
Zhang, Wei [2 ]
Ma, Kwan-Liu [1 ]
机构
[1] Univ Calif Davis, Davis, CA 95616 USA
[2] Visa Res, Palo Alto, CA 94301 USA
关键词
Head; Transformers; Visual analytics; Task analysis; Measurement; Heating systems; Deep learning; explainable artificial intelligence; multi-head self-attention; vision transformer; visual analytics;
D O I
10.1109/TVCG.2023.3261935
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Vision transformer (ViT) expands the success of transformer models from sequential data to images. The model decomposes an image into many smaller patches and arranges them into a sequence. Multi-head self-attentions are then applied to the sequence to learn the attention between patches. Despite many successful interpretations of transformers on sequential data, little effort has been devoted to the interpretation of ViTs, and many questions remain unanswered. For example, among the numerous attention heads, which one is more important? How strong are individual patches attending to their spatial neighbors in different heads? What attention patterns have individual heads learned? In this work, we answer these questions through a visual analytics approach. Specifically, we first identify what heads are more important in ViTs by introducing multiple pruning-based metrics. Then, we profile the spatial distribution of attention strengths between patches inside individual heads, as well as the trend of attention strengths across attention layers. Third, using an autoencoder-based learning solution, we summarize all possible attention patterns that individual heads could learn. Examining the attention strengths and patterns of the important heads, we answer why they are important. Through concrete case studies with experienced deep learning experts on multiple ViTs, we validate the effectiveness of our solution that deepens the understanding of ViTs from head importance, head attention strength, and head attention pattern.
引用
收藏
页码:2888 / 2900
页数:13
相关论文
共 31 条
  • [1] Abnar S, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P4190
  • [2] Aflalo E., 2022, PROC IEEECVF C COMPU, p21 406
  • [3] Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473]
  • [4] Bau A., 2019, ICLR
  • [5] Cao J, 2020, COMPUTER VISION ECCV, P565
  • [6] Cheonbok Park, 2019, 2019 IEEE Visualization Conference (VIS), P146, DOI 10.1109/VISUAL.2019.8933677
  • [7] Visual Analytics for Explainable Deep Learning
    Choo, Jaegul
    Liu, Shixia
    [J]. IEEE COMPUTER GRAPHICS AND APPLICATIONS, 2018, 38 (04) : 84 - 92
  • [8] Attention Flows: Analyzing and Comparing Attention Mechanisms in Language Models
    DeRose, Joseph F.
    Wang, Jiayao
    Berger, Matthew
    [J]. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2021, 27 (02) : 1160 - 1170
  • [9] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [10] Interactive Attention Model Explorer for Natural Language Processing Tasks with Unbalanced Data Sizes
    Dong, Zhihang
    Wu, Tongshuang
    Song, Sicheng
    Zhang, Mingrui
    [J]. 2020 IEEE PACIFIC VISUALIZATION SYMPOSIUM (PACIFICVIS), 2020, : 46 - 50