As global urbanization accelerates, accurate extraction of urban impervious surfaces has become crucial for urban planning, resource management, and environmental protection. However, existing methods face significant challenges handling high-resolution remote sensing images, especially in distinguishing impervious surfaces from regions with complex building structures and similar texture features. These challenges often lead to omission, misclassification, and blurred boundaries, affecting the accuracy and reliability of segmentation results. To address these issues, this study proposes an improved model based on CSW-DeepLabV3+. The model incorporates the cross-shaped window transformer architecture to enable the exchange of information and integration of global features across different scales, enhancing the model's performance in complex scenarios. Additionally, a coordinate attention mechanism is employed to preserve global information while improving the processing of finer details. The experimental results show that the CSW-DeepLabV3+ model not only significantly outperforms existing models such as DeepLabV3+, FCN-8s, U-net, and PSP-net in terms of extraction accuracy but also excels in capturing global information and ensuring edge clarity. The MIoU, OA, Precision, Recall, and F1 scores reached 92.99%, 92.76%, 95.24%, 94.57%, and 94.87%, respectively. Additionally, the model demonstrated excellent computational efficiency and inference speed, providing a more precise tool for urban spatial planning and optimization. Lastly, based on this model's application, a spatiotemporal analysis of impervious surfaces in the Greater Bay Area from 2017 to 2023 was conducted. Results revealed a continuous expansion of impervious surfaces during this period, mainly concentrated in the urban core and surrounding areas, with significant development toward the southeast.