To address the common issue of strong similarity and blurred boundaries between lesion and normal tissues in medical images, we propose the TransUMobileNet model, which employs a symmetrical encoder-decoder structure. First, the feature encoder uses a hybrid CNN-Transformer architecture, where the Transformer encodes tokenized image patches from convolutional neural network (CNN) feature maps as input sequences for global context extraction. The Transformer's sequence prediction attention mechanism enhances the encoding of long-range dependencies and expressive learning, strengthening global information representation. Second, the feature decoder uses a fully symmetrical encoding form. Through symmetrical skip connections, the loss of positional information in the Transformer decoding path is mitigated, improving the depiction of target boundaries. The feature decoder utilizes cascaded upsampling to restore local spatial information and enhance finer details. Additionally, a Multi-Channel Attention Fusion (MCAF) module is incorporated into the decoding section. This module, characterized by a structure with small channels at both ends and a large one in the middle, along with an attention mechanism, enriches feature information and automatically adjusts weights for key regions, enhancing focus on target areas. TransUMobileNet was evaluated on three different public medical image segmentation datasets and a custom thyroid nodule segmentation dataset. The results show that TransUMobileNet achieves a recall rate of 82.23% and a mean average precision of 95.62%, outperforming current mainstream methods for medical image segmentation.