Video anomaly detection is a critical component of intelligent video surveillance systems,extensively deployed and researched in industry and academia. However, existing methods have astrong generalization ability for predicting anomaly samples. They cannot utilize high-level semanticand temporal contextual information in videos, resulting in unstable prediction performance. Toalleviate this issue, we propose an encoder-decoder model named SMAMS, based on spatiotemporalmasked autoencoder and memory modules. First, we represent and mask some of the video eventsusing spatiotemporal cubes. Then, the unmasked patches are inputted into the spatiotemporalmasked autoencoder to extract high-level semantic and spatiotemporal features of the video events.Next, we add multiple memory modules to store unmasked video patches of different feature layers.Finally, skip connections are introduced to compensate for crucial information loss caused by thememory modules. Experimental results show that the proposed method outperforms state-of-the-artmethods, achieving AUC scores of 99.9%, 94.8%, and 78.9% on the UCSD Ped2, CUHK Avenue, andShanghai Tech datasets.