Real-time video services are gaining popularity in our daily life, yet limited network bandwidth can constrain the delivered video quality. Video Super Resolution (VSR) technology emerges as a key solution to enhance user experience by reconstructing high-resolution (HR) videos. The existing real-time VSR frameworks have primarily emphasized spatial quality metrics like PSNR and SSIM, which often lack consideration of temporal coherence, a critical factor for accurately reflecting the overall quality of super-resolved videos. Inspired by Video Quality Assessment (VQA) strategies, we propose a dual-frame training framework and a lightweight multi-branch network to address VSR processing in real time. Such designs thoroughly leverage the spatio-temporal correlations between consecutive frames so as to ensure efficient video restoration. Furthermore, we incorporate ST-RRED, a powerful VQA approach that separately measures spatial and temporal consistency aligning with human perception principles, into our loss functions. This guides us to synthesize quality-aware perceptual features across both space and time for realistic reconstruction. Our model demonstrates remarkable efficiency, achieving near real-time processing of 4K videos. Compared to the state-of-the-art lightweight model MRVSR, ours is more compact and faster, 60% smaller in size (0.483M vs. 1.21M parameters), and 106% quicker (96.44fps vs. 46.7fps on 1080p frames), with significantly improved perceptual quality.