Scale-space Tokenization for Improving the Robustness of Vision Transformers

被引:0
作者
Xu, Lei [1 ]
Kawakami, Rei [1 ]
Inoue, Nakamasa [1 ]
机构
[1] Tokyo Inst Technol, Tokyo, Japan
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
关键词
Robustness; Vision Transformer; Image Classification; Scale-space Theory; Positional Encoding;
D O I
10.1145/3581783.3612060
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The performance of the Vision Transformer (ViT) model and its variants in most vision tasks has surpassed traditional Convolutional Neural Networks (CNNs) in terms of in-distribution accuracy. However, ViTs still have significant room for improvement in their robustness to input perturbations. Furthermore, robustness is a critical aspect to consider when deploying ViTs in real-world scenarios. Despite this, some variants of ViT improve the in-distribution accuracy and computation performance at the cost of sacrificing the model's robustness and generalization. In this study, inspired by the prior findings on the potential effectiveness of shape bias to robustness improvement and the importance of multi-scale analysis, we propose a simple yet effective method, scale-space tokenization, to improve the robustness of ViT while maintaining in-distribution accuracy. Based on this method, we build Scale-space-based Robust Vision Transformer (SRVT) model. Our method consists of scale-space patch embedding and scale-space positional encoding. The scale-space patch embedding makes a sequence of variable-scale images and increases the model's shape bias to enhance its robustness. The scale-space positional encoding implicitly boosts the model's invariance to input perturbations by incorporating scale-aware position information into 3D sinusoidal positional encoding. We conduct experiments on image recognition benchmarks (CIFAR10/100 and ImageNet-1k) from the perspectives of in-distribution accuracy, adversarial and out-of-distribution robustness. The experimental results demonstrate our method's effectiveness in improving robustness without compromising in-distribution accuracy. Especially, our approach achieves advanced adversarial robustness on ImageNet-1k benchmark compared with state-of-the-art robust ViT.
引用
收藏
页码:2684 / 2693
页数:10
相关论文
共 73 条
  • [1] Albawi S, 2017, I C ENG TECHNOL
  • [2] Azulay A, 2019, J MACH LEARN RES, V20
  • [3] Bai Yutong, 2021, P ADV NEUR INF PROC
  • [4] Bhojanapalli S., 2021, P INT C COMP VIS ICC
  • [5] Bosch Anna, 2007, P ACM INT C IM VIS R
  • [6] THE LAPLACIAN PYRAMID AS A COMPACT IMAGE CODE
    BURT, PJ
    ADELSON, EH
    [J]. IEEE TRANSACTIONS ON COMMUNICATIONS, 1983, 31 (04) : 532 - 540
  • [7] dAscoli S., 2021, P INT C MACH LEARN I
  • [8] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
  • [9] Ding Yifu, 2022, P ACM INT C MULT ACM
  • [10] Fast Feature Pyramids for Object Detection
    Dollar, Piotr
    Appel, Ron
    Belongie, Serge
    Perona, Pietro
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014, 36 (08) : 1532 - 1545