Generating Transferable Adversarial Examples against Vision Transformers

被引：16

作者：

Wang, Yuxuan ^{[1
]}

Wang, Jiakai ^{[1
,2
]}

Yin, Zinxin ^{[1
]}

Gong, Ruihao ^{[1
,3
]}

Wang, Jingyi ^{[1
]}

Liu, Aishan ^{[1
]}

Liu, Xianglong ^{[1
,2
]}

机构：

[1] Beihang Univ, State Key Lab Software Dev Environm, Beijing, Peoples R China

[2] Zhongguancun Lab, Beijing, Peoples R China

[3] SenseTime, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

基金：

中国国家自然科学基金;

关键词：

vision transformer; adversarial attacks; transferability;

D O I：

10.1145/3503161.3547989

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Vision transformers (ViTs) are prevailing among several visual recognition tasks, therefore drawing intensive interest in generating adversarial examples against them. Different from CNNs, ViTs enjoy unique architectures, e.g., self-attention and image-embedding, which are commonly-shared features among various types of transformer-based models. However, existing adversarial methods suffer from weak transferable attacking ability due to the overlook of these architectural features. To address the problem, we propose an Architecture-oriented Transferable Attacking (ATA) framework to generate transferable adversarial examples by activating the uncertain attention and perturbing the sensitive embedding. Specifically, we first locate the patch-wise attentional regions that mostly affect model perception, therefore intensively activating the uncertainty of the attention mechanism and confusing the model decisions in turn. Furthermore, we search the pixel-wise attacking positions that are more likely to derange the embedded tokens using sensitive embedding perturbation, which could serve as a strong transferable attacking pattern. By jointly confusing the unique yet widely-used architectural features among transformer-based models, we can activate strong attacking transferability among diverse ViTs. Extensive experiments on large-scale dataset ImageNet using various popular transformers demonstrate that our ATA outperforms other baselines by large margins (at least +15% Attack Success Rate). Our code has been released.

引用

页码：5181 / 5190

页数：10

共 72 条

[31]

Liang Shuxian, 2021, ICCV

[32]

Liang Siyuan, 2022, ARXIV220108970

[33]

Liu AS, 2019, AAAI CONF ARTIF INTE, P1028

[34] Learning to Factorize and Relight a City [J].

Liu, Andrew ;

Ginosar, Shiry ;

Zhou, Tinghui ;

Efros, Alexei A. ;

Snavely, Noah .

COMPUTER VISION - ECCV 2020, PT IV, 2020, 12349 :544-561

[35]

Liu D, 2020, PROCEEDINGS OF THE 2ND IEEE EURASIA CONFERENCE ON BIOMEDICAL ENGINEERING, HEALTHCARE AND SUSTAINABILITY 2020 (IEEE ECBIOS 2020): BIOMEDICAL ENGINEERING, HEALTHCARE AND SUSTAINABILITY, P122, DOI 10.1109/ECBIOS50299.2020.9203674

[36]

Liu Yang, 2021, ARXIV211106091

[37] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [J].

Liu, Ze ;

Lin, Yutong ;

Cao, Yue ;

Hu, Han ;

Wei, Yixuan ;

Zhang, Zheng ;

Lin, Stephen ;

Guo, Baining .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9992-10002

[38]

Madry A., 2018, ICLR

[39]

Mahmood K., 2021, ARXIV210402610

[40] On the Robustness of Vision Transformers to Adversarial Examples [J].

Mahmood, Kaleel ;

Mahmood, Rigel ;

van Dijk, Marten .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :7818-7827

← 1 2 3 4 5 6 7 8 →