Hyperspectral image (HSI) classification is a critical task in remote sensing, but existing U-Net and transformer- based models encounter significant challenges. Traditional U-Net architectures struggle with multi-scale feature extraction due to their fixed convolutional kernels, limiting their effectiveness in capturing complex spatial distributions. Transformer models, while adept at capturing global context, suffer from high computational complexity and inadequate sensitivity to local features in HSI. To address these limitations, we propose a novel joint U-Nets with hierarchical graph structure and sparse transformer (HGSTNet). HGSTNet introduces hierarchical graph merging blocks and incremental merging methods to dynamically extract and fuse multi-scale features, leveraging superpixel segmentation and hierarchical graph topology to enhance spatial correlation. Furthermore, to enhance the model's global context perception, we integrate a sparse transformer block in the first four encoder-decoder. Unlike traditional transformers, the sparse transformer reduces computational complexity and enhances feature capture by incorporating the sparse self-attention (SPA) module, which utilizes the sparse self-attention mechanism to suppress low-relevance or redundant information, thereby improving the capture of both local and global features. Experiments conducted on multiple HSI datasets, along with comparisons to other deep learning methods, demonstrate that HGSTNet exhibits strong competitiveness.