The increasing prevalence of encrypted communication on the modern internet has presented new challenges for traffic classification and network management. Traditional traffic classification methods cannot handle encrypted traffic effectively. Meanwhile, many existing methods either rely on hand-crafted features or fail to extract the underlying interaction patterns between data packets adequately. In this paper, we propose a novel encrypted traffic classification method called the Attention-based Vision Transformer and Spatiotemporal for Traffic Classification (ATVITSC). In the preprocessing stage, packet-level images within a session, generated from the payload of data packets, are combined into a session image to mitigate information confusion. In the classification stage, session images are first processed by the packet vision transformer (PVT) module, which employs the transformer encoder and multi-head self-attention mechanism, to capture the global features. In parallel, session images are also processed by the spatiotemporal feature extraction (STFE) module, where spatial features of packets are extracted by the convolution operation with the attention mechanism and temporal features between packets are then combined by the bidirectional Long Short-Term Memory (LSTM). The global and spatiotemporal features are fused in the feature fusion classification (FFC) module by a dynamic weighting mechanism and encrypted traffic is finally classified based on the fused features. Comprehensive experiments on various types of encrypted traffic, including virtual private network (VPN), onion router (Tor), malicious traffic, and mobile traffic, show that the ATVITSC successfully improves the macro-f1 scores to 97.88%, 98.79%, 99.67%, 94.90%, respectively. The results also reveal that the ATVITSC exhibits better classification performance and generalization ability than the state-of-the-art methods.