Accurate and complete extraction of buildings from very high-resolution (VHR) remote sensing (RS) images is highly important for urban planning and land management. However, owing to the limited information available for small buildings and building boundaries, as well as challenges such as the spectral similarity of ground objects, tree occlusions, and shadow interference, automatically extracting buildings from VHR images remains challenging. These issues may result in building extraction errors such as misclassification, small building omissions, blurred boundaries, and incorrect segmentation. To address these challenges, we propose a multiscale hybrid transformer (MSHFormer) with boundary enhancement. This approach incorporates a hybrid encoder that combines a multiscale local perception (MSLP) module and a global perception module (GPM), combining the strengths of convolutional neural networks (CNNs) and transformers to achieve efficient synergy between global modeling and local feature extraction. In addition, we developed an edge enhancement module (EHM) to enhance boundary information, significantly improving building boundary segmentation accuracy. Finally, we design a group alignment feature fusion module (GAFFM) to efficiently integrate low-level features from the encoder with high-level features from the decoder, reducing feature space misalignment. The experimental results on three public datasets demonstrate the effectiveness of MSHFormer. Specifically, the proposed method achieves intersection-over-union (IoU) values of 89.1%, 73.6%, and 89.5% on the Potsdam, Massachusetts, and WHU datasets, respectively.