Objective Medical image fusion integrates lesion features and complementary information from different modalities, offering a comprehensive and accurate description of medical images for clinical diagnosis. Traditional methods often result in reduced contrast and spectral degradation due to differences between multimodal images. Frequency-domain techniques mitigate these issues but rely on manually designed feature extraction and fusion rules, lacking robustness and adaptability. While deep learning-based fusion methods, such as convolutional neural networks and Transformer, have shown promising results in feature extraction and reconstruction, they often overlook complementary characteristics between modalities, leading to insufficient global information capture. Although frequency-domain methods preserve high-frequency information, they fail to adequately correlate global and local features, neglecting unique aspects of each modality and resulting in excessive smoothing and blurring. This study proposes an adaptive medical image fusion method based on cross-modality perception and spatial-frequency interaction. Methods An adaptive medical image fusion network combining cross-modality perception with spatial-frequency interaction was developed. First, a cross-modality perceptual module utilizing channel and coordinate attention mechanisms extracts multiscale deep features and local abnormality information, reducing information loss between modalities. Second, a spatial-frequency cross-fusion module based on frequency information exchange and spatial domain adaptive cross-fusion was proposed. This module alleviates information imbalance between modalities by exchanging phase information in the frequency domain and dynamically learning global interaction features in the spatial domain. This process highlights prominent targets while preserving critical pathological information and texture details. Finally, a loss function comprising content, structure, and spectral terms was designed to further improve the quality of the fused image. Results and Discussions Fusion experiment for mild Alzheimer's disease demonstrates that the proposed method better preserves positron emission tomography functional information and edge details of magnetic resonance imaging (MRI) soft tissue, improving image contrast and detail presentation compared to other methods. The fusion experiment for metastatic bronchogenic carcinoma shows that other methods suffer from low resolution, blurred textures, and noise interference, hindering the observation and diagnosis of the lesion area. In contrast, the proposed method effectively retains single-photon emission computed tomography metabolic information and MRI soft tissue edge details, enabling doctors to comprehensively evaluate lesion status. The sarcoma fusion experiment further validates the algorithm's effectiveness in preserving tissue edges, grayscale information, and density structure integrity. From tables 1?3, the AG (average gradient), MI (mutual information), SF (spatial frequency), Q(AB/F )(fusion quality), CC (correlation coefficient), and VIF (visual information fidelity) indicators demonstrate strong performance. Specifically, the high MI value indicates that the fusion image contains rich features and edge information. The high SF value shows that the fusion image retains additional global information from the source images, with clear details and texture features. The high VIF value reflects consistency with the human eye's visual characteristics, while the high Q(AB/F) value indicates that the fusion image maintains spatial details consistent with the source images. Compared with other algorithms, the proposed method emphasizes the perception and interaction of structural image texture contours and functional image metabolic brightness during feature extraction and fusion, addressing issues such as structural edge loss and lesion detail blurring in existing fusion methods. Conclusions To enhance the quality of multimodal medical image fusion, this study proposes a method combining cross-modality perception with spatial-frequency interaction. During feature extraction, a multiscale cross-modality perception network facilitates the interaction of structural and functional information, fully extracting source image data and enhancing local lesion features. In the fusion stage, functional and anatomical key information is preserved through frequency-domain exchange, followed by cross-attention for adaptive fusion, ensuring that detailed texture and overall edge profile information are fully fused. Additionally, content, structural, and spectral losses were designed to retain complementary and chromatic information. Experimental results demonstrate that the proposed method improves AG, MI, SF, Q(AB/F), CC, and VIF by 4.4%, 13.2%, 2.7%, 3.4%, 11%, and 3%, respectively, showing that the method effectively retains unique information from each modality, resulting in fusion images with clear edges, rich lesion details, and high visual fidelity. In the task of fusing multimodal medical images of the abdomen with green fluorescent protein and phase contrast images, the proposed method demonstrates strong generalization, supporting its potential for application in other biomedical diagnostic tasks and enhancing clinicians' diagnostic efficiency.