Deep learning in medical imaging became a focal point in research, emphasizing techniques to automatically detect ailments in magnetic resonance images (MRI) and X-rays. Beyond these imaging modalities, there is a recognized need to apply soft computing solutions to Laryngeal Video Endoscopy (LVE), specifically in the context of Videostroboscopy. While this technique offers visual oscillations of the vocal fold, manual examination of the glottis presents inherent challenges. In current literature, various proposals favor segmentation models such as U-Net, trained with labeled data, to address this diagnostic challenge. Fortunately, the Benchmark for Automatic Glottis Segmentation (BAGLS) dataset, comprising 55,750 meticulously labeled images, has become available. However, training a U-Net model with BAGLS requires substantial computational resource investment. Therefore, this study proposes an approach to reduce computational costs by designing a lightweight U-Net incorporating separable depthwise convolutions. In addition to minimizing computational demands, this study enhances the compact architecture by incorporating attention gates, skip connections, and squeeze-andexcitation blocks. These modifications tend to extract robust features while avoiding unnecessary parameter enlargement. Despite its compact nature, the proposed S3AR U-Net includes a feature similarity module that enriches its spectrum of features. Overall, the S3AR U-Net introduced in this study proves to be a cost-efficient solution, boasting only 62 K parameters and operating at 0.21 G FLOPs. Impressively, it achieves a DSC of 88.73 % and IoU of 79.97 %. These performance metrics position the S3AR U-Net favorably compared to most existing U-Net solutions, showcasing its efficacy in automatically segmenting a person's glottis during medical imaging analysis.