The spectral line confocal sensor, with its significant advantages of submicron-level high resolution, fast scanning imaging, multi-parameter measurement, and non-contact measurement, is widely utilized in industries such as aerospace, military, semiconductors, and new energy. However, achieving consistent resolution across the sensor's extensive axial measurement range remains challenging due to light field degradation in images captured by area array CMOS sensors. This lack of uniform resolution, particularly at the edges, leads to data distortion and error accumulation, impacting the system's overall accuracy and reliability. To address this limitation, this paper proposes a network architecture that integrates a Vision Transformer (ViT) with a perpendicular attention parallel CNN, employing deep learning to restore the images captured by the sensor. This approach combines local and global information, enabling the extraction of image features at different scales and enhancing the model's capability to capture information in various directions within the image. As a result, the sensor achieves highly uniform signal quality, enabling accurate 3D reconstruction. Furthermore, the effectiveness of the proposed method is experimentally validated through measurements of periodic structures on a wafer and semiconductor.