With the rapid advancement of communication technologies, wireless networks have not only transformed people's lifestyles but also spurred the development of numerous emerging applications and services. Against this backdrop, research on Wi-Fi-based human activity recognition (HAR) has become a hot topic in both academia and industry. Channel State Information (CSI) contains rich spatiotemporal information. However, existing deep learning methods for human activity recognition (HAR) typically focus on either temporal or spatial features. While some approaches do combine both types of features, they often emphasize temporal sequences and underutilize spatial information. In contrast, this paper proposes an enhanced approach by modifying residual networks (ResNet) instead of using simple CNN. This modification allows for effective spatial feature extraction while preserving temporal information. The extracted spatial features are then fed into a modifying GRU model for temporal sequence learning. Our model achieves an accuracy of 99.4% on the UT_HAR dataset and 99.24% on the NTU-FI HAR dataset. Compared to other existing models, RGANet shows improvements of 1.21% on the UT_HAR dataset and 0.38% on the NTU-FI HAR dataset.