How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

被引：14

作者：

Li, Yiran ^{[1
]}

Wang, Junpeng ^{[2
]}

Dai, Xin ^{[2
]}

Wang, Liang ^{[2
]}

Yeh, Chin-Chia Michael ^{[2
]}

Zheng, Yan ^{[2
]}

Zhang, Wei ^{[2
]}

Ma, Kwan-Liu ^{[1
]}

机构：

[1] Univ Calif Davis, Davis, CA 95616 USA

[2] Visa Res, Palo Alto, CA 94301 USA

来源：

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS | 2023年 / 29卷 / 06期

关键词：

Head; Transformers; Visual analytics; Task analysis; Measurement; Heating systems; Deep learning; explainable artificial intelligence; multi-head self-attention; vision transformer; visual analytics;

D O I：

10.1109/TVCG.2023.3261935

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Vision transformer (ViT) expands the success of transformer models from sequential data to images. The model decomposes an image into many smaller patches and arranges them into a sequence. Multi-head self-attentions are then applied to the sequence to learn the attention between patches. Despite many successful interpretations of transformers on sequential data, little effort has been devoted to the interpretation of ViTs, and many questions remain unanswered. For example, among the numerous attention heads, which one is more important? How strong are individual patches attending to their spatial neighbors in different heads? What attention patterns have individual heads learned? In this work, we answer these questions through a visual analytics approach. Specifically, we first identify what heads are more important in ViTs by introducing multiple pruning-based metrics. Then, we profile the spatial distribution of attention strengths between patches inside individual heads, as well as the trend of attention strengths across attention layers. Third, using an autoencoder-based learning solution, we summarize all possible attention patterns that individual heads could learn. Examining the attention strengths and patterns of the important heads, we answer why they are important. Through concrete case studies with experienced deep learning experts on multiple ViTs, we validate the effectiveness of our solution that deepens the understanding of ViTs from head importance, head attention strength, and head attention pattern.

引用

页码：2888 / 2900

页数：13

共 31 条

[1]

Abnar S, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P4190

[2]

Aflalo E., 2022, PROC IEEECVF C COMPU, p21 406

[3]

Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]

[4]

Bau A., 2019, ICLR

[5] Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models [J].

Cao, Jize ;

Gan, Zhe ;

Cheng, Yu ;

Yu, Licheng ;

Chen, Yen-Chun ;

Liu, Jingjing .

COMPUTER VISION - ECCV 2020, PT VI, 2020, 12351 :565-580

[6]

Cheonbok Park, 2019, 2019 IEEE Visualization Conference (VIS), P146, DOI 10.1109/VISUAL.2019.8933677

[7] Visual Analytics for Explainable Deep Learning [J].

Choo, Jaegul ;

Liu, Shixia .

IEEE COMPUTER GRAPHICS AND APPLICATIONS, 2018, 38 (04) :84-92

[8] Attention Flows: Analyzing and Comparing Attention Mechanisms in Language Models [J].

DeRose, Joseph F. ;

Wang, Jiayao ;

Berger, Matthew .

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2021, 27 (02) :1160-1170

[9]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[10] Interactive Attention Model Explorer for Natural Language Processing Tasks with Unbalanced Data Sizes [J].

Dong, Zhihang ;

Wu, Tongshuang ;

Song, Sicheng ;

Zhang, Mingrui .

2020 IEEE PACIFIC VISUALIZATION SYMPOSIUM (PACIFICVIS), 2020, :46-50

← 1 2 3 4 →