Facial Expression Recognition (FER) faces significant challenges, primarily due to significant variations within classes and subtle visual differences between classes, and limited dataset sizes. Real-world factors such as pose, illumination, and partial occlusion further hinder FER performance. To tackle these challenges, multi-scale and attention-based networks have been widely employed. However, previous approaches have primarily focused on increasing depth while neglecting width, resulting in an inadequate representation of granular facial expression features. This study introduces a novel FER model. A multi-scale attention network (MSA-Net) is designed as a more extensive and deeper network that captures features from various receptive fields through a parallel network structure. Each parallel branch in the proposed network utilizes channel complementary multi-scale blocks, e.g., left multi-scale (MS-L) and right multi-scale (MS-R), to broaden the effective receptive field and capture features having diversity. Additionally, attention networks are employed to emphasize important regions and boost the discriminative capability of the multi-scale features. The performance evaluation of the proposed method was carried out on two popular real-world FER databases: AffectNet and RAF-DB. Our MSA-Net has reduced the impact of the pose, partial occlusions and the network's susceptibility to subtle expression-related variations, thereby outperforming other methods in FER.