With the development of digital imaging devices, the process of recording sensitive information displayed on screens through mobile phones and cameras has become a prominent technique for modern data leaks. In order to identify the origin of information violations, Screen-Shooting Resistant Watermarking (SSRW) has attracted a lot of attention. Most existing solutions are based on Convolutional Neural Networks (CNNs) for the embedding of watermarks. However, due to the limited reception field of CNNs, they are proficient in extracting local features but cannot understand the entire image. This paper presents a new watermarking system that is resistant to screen recording, with multi-head and cross-attention to incorporate watermarks, replacing the encoder in the end-to-end architecture. Specifically, we segment the image and watermark into smaller patches for positional embedding. Afterward, we calculate the attention scores through multi-head attention layers and generate the encoded image through concatenation. This approach increases the model’s ability to comprehend the entire image, thereby increasing performance. In addition, we enhance the U-Net network structure to replace the end-to-end decoder. The experimental results demonstrate that the proposed method not only reaches more than 95% accuracy in different capture scenarios but also excels in terms of reliability and invisibility relative to current state-of-the-art (SOTA) methods. In addition, this approach yields impressive PSNR and SSIM average values of 41.90 dB and 0.99, showing the excellent visual quality and reliability of the watermarked images.