In the field of medical image segmentation, UNet has emerged as a widely utilized backbone network architecture. The emergence of deep learning techniques such as convolutional neural networks (CNNs), attention mechanisms, and Transformer has provided a foundation for building newer and more powerful versions of UNet. The pure CNNs-based UNet has demonstrated excellent performance in medical image segmentation, and recently, the pure Transformer-based UNet has achieved even better segmentation results. Owing to their local inductive bias, CNNs excel at capturing local features and generate fine but potentially incomplete results, whereas Transformers excel at capturing global context and generate complete but less detailed results. Recently, some studies have explored the integration of CNNs and Transformers, achieving promising performance. In this paper, we introduce a novel dual encoders architecture that combines Swin Transformer and CNNs. Unlike prior methods, our architecture comprises two distinct sets of encoders: one leveraging Swin Transformer and the other utilizing CNNs. Furthermore, a spatial-channel attention-based fusion(SCAF) module is designed to effectively fuse the outputs. These innovative designs empower our network to effectively grasp both global context and local textural details, thereby enhancing the performance of medical image segmentation. Experimental results demonstrate that the proposed method outperforms previous state-of-the-art methods on both the Synapse multi-organ CT dataset and the ACDC dataset.