Due to the limitations of imaging sensors, obtaining a medical image that simultaneously captures both functional metabolic data and structural tissue details remains a significant challenge in clinical diagnosis. To address this, Multimodal Medical Image Fusion (MMIF) has emerged as an effective technique for integrating complementary information from multimodal source images, such as CT, PET, and SPECT, which is critical for providing a comprehensive understanding of both anatomical and functional aspects of the human body. One of the key challenges in MMIF is how to exchange and aggregate this multimodal information. This article rethinks MMIF by addressing the harmony of modality gaps and proposes a novel Modality-Aware Interaction Network (MAINet), which leverages cross-modal feature interaction and progressively fuses multiple features in graph space. Specifically, we introduce two key modules: the Cascade Modality Interaction (CMI) module and the Dual-Graph Learning (DGL) module. The CMI module, integrated within a multi-scale encoder with triple branches, facilitates complementary multimodal feature learning and provides beneficial feedback to enhance discriminative feature learning across modalities. In the decoding process, the DGL module aggregates hierarchical features in two distinct graph spaces, enabling global feature interactions. Moreover, the DGL module incorporates a bottom-up guidance mechanism, where deeper semantic features guide the learning of shallower detail features, thus improving the fusion process by enhancing both scale diversity and modality awareness for visual fidelity results. Experimental results on medical image datasets demonstrate the superiority of the proposed method over existing fusion approaches in both subjective and objective evaluations. We also validated the performance of the proposed method in applications such as infrared-visible image fusion and medical image segmentation.