The advancement of wireless communication toward beyond fifth-generation (B5G) and sixth-generation (6G) standards demands intelligent and scalable signal processing capable of accurate modulation classification under dynamic and noisy channel conditions. Automatic modulation classification (AMC) is essential for spectrum awareness, interference mitigation, and autonomous decision-making. While deep learning (DL) models have improved AMC performance, many conventional architectures face limitations due to sequential processing, high complexity, and reduced robustness in low signal-to-noise ratio (SNR) environments. To address these limitations, this paper presents a novel attention-enhanced hybrid AMC framework that synergistically integrates specialized convolutional layers for efficient temporal feature extraction with a compact transformer encoder for global sequence modeling. The proposed architecture is meticulously tailored to capture the amplitude-phase dynamics inherent in in-phase/quadrature ( $I/Q$ ) modulated signals while maintaining lightweight design principles. Furthermore, the proposed architecture employs state-of-the-art computational optimizations, including gradient scaling, mixed precision training, and adaptive moment estimation with fixed weight decay (AdamW), to enhance energy efficiency and accelerate convergence. Unlike conventional DL architectures, the proposed technique introduces an efficient parallelized sequence processing strategy that significantly reduces training complexity while preserving robust adaptability across diverse conditions. In contrast to existing hybrid or purely sequential architectures, this design attains high classification fidelity without the need for complex graph-based structures or stacked attention mechanisms, thereby enhancing both model interpretability and practical deployment feasibility. Extensive benchmarking is conducted employing the RadioML 2018.01A dataset, comprising 24 diverse modulation types across various SNR levels, simulating real-world impairments such as fading, oscillator drift, and noise. The proposed framework is rigorously compared against six representative models-recurrent neural networks (RNN), long short-term memory (LSTM), gated recurrent units (GRU), convolutional neural networks-transformer graph neural network (CTGNet), MobileViT, and DeepsigNet-across multiple evaluation criteria. Experimental results consistently validate the effectiveness of the proposed framework, which surpasses leading architectures in accuracy under both static and dynamic (i.e., Jakes-fading) channel conditions. The model achieves an average accuracy improvement of 14.86% over benchmark methods and delivers up to a 38% reduction in inference latency. The study further includes comprehensive ablation analysis, statistical validation, and computational benchmarking, confirming the proposed model as a robust, efficient, and scalable solution for next-generation AMC applications.