Video representation learning is crucial for various tasks, and self-attention has emerged as an effective technique for capturing long-range dependencies. However, existing methods often neglect the distinct contextual information conveyed by spatial and temporal correlations by computing pairwise correlations simultaneously along both dimensions. To address this limitation, we suggest a novel module that sequentially models spatial and temporal correlations. This enables the efficient integration of spatial contexts into temporal modeling. By incorporating this module into a 2D CNN, we develop a self-attention module network tailored for video visualization. We evaluate the effectiveness of our approach on two benchmark datasets: Charades STA and QVHighlight, which are relevant for moment retrieval and highlight detection tasks. Through extensive experimentation, our findings show that on both datasets, the self-attention element network exceeds current methods. Notably, our models consistently surpass shallower networks and those with fewer modalities, highlighting the superiority of our approach. In summary, our proposed self-attention module contributes to advancing video representation learning by effectively capturing spatial and temporal correlations. The notable improvements achieved in moment retrieval and highlight detection tasks validate the efficacy and versatility of our approach.