In the computer vision domain, image super-resolution (SR) technology, which restores high-resolution details from low-resolution images, plays a vital role in practical applications such as medical imaging, public safety, and remote sensing. Traditional methods employ convolutional neural networks to address these issues, while Visual Transformers show potential performance in high-level vision tasks. However, compared to typical CNN architecture networks, Visual Transformers exhibit weaker reliance on high-frequency information in images, leading to blurred details and residual artifacts. To solve this issue, we use a hierarchical network structure, which allows for a more flexible feeling field for our approach. Firstly, our method complements lost spatial features using a Convolutional Swin Transformer Layer incorporating a Convolutional Feed Forward Network. This allows for the retrieval of missing spatial information and enhances the model's representational capabilities. Next, deep feature extraction is performed by combining multiple layers into a Residual Convolutional Swin Transformer Block. Finally, we employ a hierarchical-type structure to combine the features of each branch. Experiments validate the effectiveness of the proposed method in generating images with greater detail aligned with human perception. Based on the experiments, our method is effective on SR tasks with magnification factors of 2, 3, and 4. Our method can reconstruct a clear and complete edge structure. We provide code at https://github.com/Q88392/HCT.