Generating Transferable Adversarial Examples against Vision Transformers

被引:16
作者
Wang, Yuxuan [1 ]
Wang, Jiakai [1 ,2 ]
Yin, Zinxin [1 ]
Gong, Ruihao [1 ,3 ]
Wang, Jingyi [1 ]
Liu, Aishan [1 ]
Liu, Xianglong [1 ,2 ]
机构
[1] Beihang Univ, State Key Lab Software Dev Environm, Beijing, Peoples R China
[2] Zhongguancun Lab, Beijing, Peoples R China
[3] SenseTime, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
基金
中国国家自然科学基金;
关键词
vision transformer; adversarial attacks; transferability;
D O I
10.1145/3503161.3547989
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Vision transformers (ViTs) are prevailing among several visual recognition tasks, therefore drawing intensive interest in generating adversarial examples against them. Different from CNNs, ViTs enjoy unique architectures, e.g., self-attention and image-embedding, which are commonly-shared features among various types of transformer-based models. However, existing adversarial methods suffer from weak transferable attacking ability due to the overlook of these architectural features. To address the problem, we propose an Architecture-oriented Transferable Attacking (ATA) framework to generate transferable adversarial examples by activating the uncertain attention and perturbing the sensitive embedding. Specifically, we first locate the patch-wise attentional regions that mostly affect model perception, therefore intensively activating the uncertainty of the attention mechanism and confusing the model decisions in turn. Furthermore, we search the pixel-wise attacking positions that are more likely to derange the embedded tokens using sensitive embedding perturbation, which could serve as a strong transferable attacking pattern. By jointly confusing the unique yet widely-used architectural features among transformer-based models, we can activate strong attacking transferability among diverse ViTs. Extensive experiments on large-scale dataset ImageNet using various popular transformers demonstrate that our ATA outperforms other baselines by large margins (at least +15% Attack Success Rate). Our code has been released.
引用
收藏
页码:5181 / 5190
页数:10
相关论文
共 72 条
[1]  
Aldahdooh Ahmed, 2021, ARXIV210603734
[2]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[3]  
Bai Yutong, 2021, ARXIV211105464
[4]   Understanding Robustness of Transformers for Image Classification [J].
Bhojanapalli, Srinadh ;
Chakrabarti, Ayan ;
Glasner, Daniel ;
Li, Daliang ;
Unterthiner, Thomas ;
Veit, Andreas .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :10211-10221
[5]  
Brown T.B., 2017, Adversarial patch
[6]   Transformer Interpretability Beyond Attention Visualization [J].
Chefer, Hila ;
Gur, Shir ;
Wolf, Lior .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :782-791
[7]   CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].
Chen, Chun-Fu ;
Fan, Quanfu ;
Panda, Rameswar .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356
[8]   Rethinking Generative Zero-Shot Learning: An Ensemble Learning Perspective for Recognising Visual Patches [J].
Chen, Zhi ;
Wang, Sen ;
Li, Jingjing ;
Huang, Zi .
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :3413-3421
[9]  
Chen Z, 2020, IEEE WINT CONF APPL, P863, DOI [10.1109/WACV45572.2020.9093610, 10.1109/wacv45572.2020.9093610]
[10]  
Chen Zhi, 2021, P 28 ACM INT C MULT