Scale-space Tokenization for Improving the Robustness of Vision Transformers

被引:0
|
作者
Xu, Lei [1 ]
Kawakami, Rei [1 ]
Inoue, Nakamasa [1 ]
机构
[1] Tokyo Inst Technol, Tokyo, Japan
关键词
Robustness; Vision Transformer; Image Classification; Scale-space Theory; Positional Encoding;
D O I
10.1145/3581783.3612060
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The performance of the Vision Transformer (ViT) model and its variants in most vision tasks has surpassed traditional Convolutional Neural Networks (CNNs) in terms of in-distribution accuracy. However, ViTs still have significant room for improvement in their robustness to input perturbations. Furthermore, robustness is a critical aspect to consider when deploying ViTs in real-world scenarios. Despite this, some variants of ViT improve the in-distribution accuracy and computation performance at the cost of sacrificing the model's robustness and generalization. In this study, inspired by the prior findings on the potential effectiveness of shape bias to robustness improvement and the importance of multi-scale analysis, we propose a simple yet effective method, scale-space tokenization, to improve the robustness of ViT while maintaining in-distribution accuracy. Based on this method, we build Scale-space-based Robust Vision Transformer (SRVT) model. Our method consists of scale-space patch embedding and scale-space positional encoding. The scale-space patch embedding makes a sequence of variable-scale images and increases the model's shape bias to enhance its robustness. The scale-space positional encoding implicitly boosts the model's invariance to input perturbations by incorporating scale-aware position information into 3D sinusoidal positional encoding. We conduct experiments on image recognition benchmarks (CIFAR10/100 and ImageNet-1k) from the perspectives of in-distribution accuracy, adversarial and out-of-distribution robustness. The experimental results demonstrate our method's effectiveness in improving robustness without compromising in-distribution accuracy. Especially, our approach achieves advanced adversarial robustness on ImageNet-1k benchmark compared with state-of-the-art robust ViT.
引用
收藏
页码:2684 / 2693
页数:10
相关论文
共 50 条
  • [1] Scale-space filters and their robustness
    Harvey, R
    Bangham, JA
    Bosson, A
    SCALE-SPACE THEORY IN COMPUTER VISION, 1997, 1252 : 341 - 344
  • [2] MSViT: Dynamic Mixed-scale Tokenization for Vision Transformers
    Havtorn, Jakob Drachmann
    Royer, Amelie
    Blankevoort, Tijmen
    Bejnordi, Babak Ehteshami
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 838 - 848
  • [3] Generalized Gaussian Scale-Space Axiomatics Comprising Linear Scale-Space, Affine Scale-Space and Spatio-Temporal Scale-Space
    Tony Lindeberg
    Journal of Mathematical Imaging and Vision, 2011, 40 : 36 - 81
  • [4] Improving Robustness of Vision Transformers by Reducing Sensitivity to Patch Corruptions
    Guo, Yong
    Stutz, David
    Schiele, Bernt
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 4108 - 4118
  • [5] Generalized Gaussian Scale-Space Axiomatics Comprising Linear Scale-Space, Affine Scale-Space and Spatio-Temporal Scale-Space
    Lindeberg, Tony
    JOURNAL OF MATHEMATICAL IMAGING AND VISION, 2011, 40 (01) : 36 - 81
  • [6] Scale-space methods for image feature modeling in vision metrology
    Fraser, CS
    Shao, J
    PHOTOGRAMMETRIC ENGINEERING AND REMOTE SENSING, 1998, 64 (04): : 323 - 328
  • [7] SCALE-SPACE TRACKING AND DEFORMABLE SHEET MODELS FOR COMPUTATIONAL VISION
    WHITTEN, G
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1993, 15 (07) : 697 - 706
  • [8] Understanding The Robustness in Vision Transformers
    Zhou, Daquan
    Yu, Zhiding
    Xie, Enze
    Xiao, Chaowei
    Anandkumar, Anima
    Feng, Jiashi
    Alvarez, Jose M.
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [9] From Gaussian scale-space to B-spline scale-space
    Wang, YP
    Lee, SL
    ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, : 3441 - 3444
  • [10] Linear scale-space
    Florack, L.M.J.
    Romeny, Haar
    Koenderink, J.J.
    Viergever, M.A.
    Journal of Mathematical Imaging and Vision, 1994, 4 (04) : 325 - 351