Generating Transferable Adversarial Examples against Vision Transformers

被引：16

作者：

Wang, Yuxuan ^{[1
]}

Wang, Jiakai ^{[1
,2
]}

Yin, Zinxin ^{[1
]}

Gong, Ruihao ^{[1
,3
]}

Wang, Jingyi ^{[1
]}

Liu, Aishan ^{[1
]}

Liu, Xianglong ^{[1
,2
]}

机构：

[1] Beihang Univ, State Key Lab Software Dev Environm, Beijing, Peoples R China

[2] Zhongguancun Lab, Beijing, Peoples R China

[3] SenseTime, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

基金：

中国国家自然科学基金;

关键词：

vision transformer; adversarial attacks; transferability;

D O I：

10.1145/3503161.3547989

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Vision transformers (ViTs) are prevailing among several visual recognition tasks, therefore drawing intensive interest in generating adversarial examples against them. Different from CNNs, ViTs enjoy unique architectures, e.g., self-attention and image-embedding, which are commonly-shared features among various types of transformer-based models. However, existing adversarial methods suffer from weak transferable attacking ability due to the overlook of these architectural features. To address the problem, we propose an Architecture-oriented Transferable Attacking (ATA) framework to generate transferable adversarial examples by activating the uncertain attention and perturbing the sensitive embedding. Specifically, we first locate the patch-wise attentional regions that mostly affect model perception, therefore intensively activating the uncertainty of the attention mechanism and confusing the model decisions in turn. Furthermore, we search the pixel-wise attacking positions that are more likely to derange the embedded tokens using sensitive embedding perturbation, which could serve as a strong transferable attacking pattern. By jointly confusing the unique yet widely-used architectural features among transformer-based models, we can activate strong attacking transferability among diverse ViTs. Extensive experiments on large-scale dataset ImageNet using various popular transformers demonstrate that our ATA outperforms other baselines by large margins (at least +15% Attack Success Rate). Our code has been released.

引用

页码：5181 / 5190

页数：10

共 72 条

[1]

Aldahdooh Ahmed, 2021, ARXIV210603734

[2] ViViT: A Video Vision Transformer [J].

Arnab, Anurag ;

Dehghani, Mostafa ;

Heigold, Georg ;

Sun, Chen ;

Lucic, Mario ;

Schmid, Cordelia .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826

[3]

Bai Yutong, 2021, ARXIV211105464

[4] Understanding Robustness of Transformers for Image Classification [J].

Bhojanapalli, Srinadh ;

Chakrabarti, Ayan ;

Glasner, Daniel ;

Li, Daliang ;

Unterthiner, Thomas ;

Veit, Andreas .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :10211-10221

[5]

Brown T.B., 2017, Adversarial patch

[6] Transformer Interpretability Beyond Attention Visualization [J].

Chefer, Hila ;

Gur, Shir ;

Wolf, Lior .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :782-791

[7] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].

Chen, Chun-Fu ;

Fan, Quanfu ;

Panda, Rameswar .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356

[8] Rethinking Generative Zero-Shot Learning: An Ensemble Learning Perspective for Recognising Visual Patches [J].

Chen, Zhi ;

Wang, Sen ;

Li, Jingjing ;

Huang, Zi .

MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :3413-3421

[9]

Chen Z, 2020, IEEE WINT CONF APPL, P863, DOI [10.1109/WACV45572.2020.9093610, 10.1109/wacv45572.2020.9093610]

[10]

Chen Zhi, 2021, P 28 ACM INT C MULT

← 1 2 3 4 5 6 7 8 →