The performance of the Vision Transformer (ViT) model and its variants in most vision tasks has surpassed traditional Convolutional Neural Networks (CNNs) in terms of in-distribution accuracy. However, ViTs still have significant room for improvement in their robustness to input perturbations. Furthermore, robustness is a critical aspect to consider when deploying ViTs in real-world scenarios. Despite this, some variants of ViT improve the in-distribution accuracy and computation performance at the cost of sacrificing the model's robustness and generalization. In this study, inspired by the prior findings on the potential effectiveness of shape bias to robustness improvement and the importance of multi-scale analysis, we propose a simple yet effective method, scale-space tokenization, to improve the robustness of ViT while maintaining in-distribution accuracy. Based on this method, we build Scale-space-based Robust Vision Transformer (SRVT) model. Our method consists of scale-space patch embedding and scale-space positional encoding. The scale-space patch embedding makes a sequence of variable-scale images and increases the model's shape bias to enhance its robustness. The scale-space positional encoding implicitly boosts the model's invariance to input perturbations by incorporating scale-aware position information into 3D sinusoidal positional encoding. We conduct experiments on image recognition benchmarks (CIFAR10/100 and ImageNet-1k) from the perspectives of in-distribution accuracy, adversarial and out-of-distribution robustness. The experimental results demonstrate our method's effectiveness in improving robustness without compromising in-distribution accuracy. Especially, our approach achieves advanced adversarial robustness on ImageNet-1k benchmark compared with state-of-the-art robust ViT.