Generating Transferable Adversarial Examples against Vision Transformers

被引:16
作者
Wang, Yuxuan [1 ]
Wang, Jiakai [1 ,2 ]
Yin, Zinxin [1 ]
Gong, Ruihao [1 ,3 ]
Wang, Jingyi [1 ]
Liu, Aishan [1 ]
Liu, Xianglong [1 ,2 ]
机构
[1] Beihang Univ, State Key Lab Software Dev Environm, Beijing, Peoples R China
[2] Zhongguancun Lab, Beijing, Peoples R China
[3] SenseTime, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
基金
中国国家自然科学基金;
关键词
vision transformer; adversarial attacks; transferability;
D O I
10.1145/3503161.3547989
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Vision transformers (ViTs) are prevailing among several visual recognition tasks, therefore drawing intensive interest in generating adversarial examples against them. Different from CNNs, ViTs enjoy unique architectures, e.g., self-attention and image-embedding, which are commonly-shared features among various types of transformer-based models. However, existing adversarial methods suffer from weak transferable attacking ability due to the overlook of these architectural features. To address the problem, we propose an Architecture-oriented Transferable Attacking (ATA) framework to generate transferable adversarial examples by activating the uncertain attention and perturbing the sensitive embedding. Specifically, we first locate the patch-wise attentional regions that mostly affect model perception, therefore intensively activating the uncertainty of the attention mechanism and confusing the model decisions in turn. Furthermore, we search the pixel-wise attacking positions that are more likely to derange the embedded tokens using sensitive embedding perturbation, which could serve as a strong transferable attacking pattern. By jointly confusing the unique yet widely-used architectural features among transformer-based models, we can activate strong attacking transferability among diverse ViTs. Extensive experiments on large-scale dataset ImageNet using various popular transformers demonstrate that our ATA outperforms other baselines by large margins (at least +15% Attack Success Rate). Our code has been released.
引用
收藏
页码:5181 / 5190
页数:10
相关论文
共 72 条
[31]  
Liang Shuxian, 2021, ICCV
[32]  
Liang Siyuan, 2022, ARXIV220108970
[33]  
Liu AS, 2019, AAAI CONF ARTIF INTE, P1028
[34]   Learning to Factorize and Relight a City [J].
Liu, Andrew ;
Ginosar, Shiry ;
Zhou, Tinghui ;
Efros, Alexei A. ;
Snavely, Noah .
COMPUTER VISION - ECCV 2020, PT IV, 2020, 12349 :544-561
[35]  
Liu D, 2020, PROCEEDINGS OF THE 2ND IEEE EURASIA CONFERENCE ON BIOMEDICAL ENGINEERING, HEALTHCARE AND SUSTAINABILITY 2020 (IEEE ECBIOS 2020): BIOMEDICAL ENGINEERING, HEALTHCARE AND SUSTAINABILITY, P122, DOI 10.1109/ECBIOS50299.2020.9203674
[36]  
Liu Yang, 2021, ARXIV211106091
[37]   Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [J].
Liu, Ze ;
Lin, Yutong ;
Cao, Yue ;
Hu, Han ;
Wei, Yixuan ;
Zhang, Zheng ;
Lin, Stephen ;
Guo, Baining .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9992-10002
[38]  
Madry A., 2018, ICLR
[39]  
Mahmood K., 2021, ARXIV210402610
[40]   On the Robustness of Vision Transformers to Adversarial Examples [J].
Mahmood, Kaleel ;
Mahmood, Rigel ;
van Dijk, Marten .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :7818-7827