In recent years, research on infrared and visible image fusion has mainly focused on deep learning-based approaches, particularly deep neural networks with auto-encoder architectures. However, these approaches suffer from problems such as insufficient feature extraction capability and inefficient fusion strategies. Therefore, this paper introduces a novel image fusion network to address the limitations of infrared and visible image fusion networks with auto-encoder architectures. In the designed network, the encoder employs a multi-branch cascade structure, and these convolution branches with different kernel sizes provide the encoder with an adaptive receptive field to extract multi-scale features. In addition, the fusion layer incorporates a non-local attention module that is inspired by the self-attention mechanism. With its global receptive field, this module is used to build a non-local attention fusion network, which works together with the l1${l}_1$-norm spatial fusion strategy to extract, split, filter, and fuse global and local features. Comparative experiments on the TNO and MSRS datasets demonstrate that the proposed method outperforms other state-of-the-art fusion approaches. This paper introduces a novel infrared and visible image fusion network to address the limitations of auto-encoder fusion networks. In the designed network, the encoder employs a multi-branch cascade structure with convolution kernels of different sizes to extract multi-scale features, and the fusion layer incorporates a non-local attention module alongside a spatial feature fusion strategy for both global and local feature fusion. Comparative experiments on the TNO and MSRS datasets demonstrate that the proposed method outperforms other state-of-the-art fusion approaches. image