This paper addresses the problem of Multi-modal Federated Learning (MFL) over resource-limited Cell-Free massive MIMO (CF-mMIMO) networks for the application of Human Activity Recognition (HAR). MFL leverages diverse data modalities across various clients, while the CF-mMIMO network ensures consistent service quality, crucial for collaborative training. The primary challenges of MFL are data heterogeneity, which includes statistical and modality heterogeneity that complicate data fusion, client collaboration, and inference with missing data, and system heterogeneity, where devices with dissimilar modalities experience varied processing and communication delays, increasing overall training latency. To tackle these issues, we propose a late-fusion model architecture that allows flexible client participation with any combination of data modalities, and formulate an optimization problem to jointly minimize latency and global loss in MFL. We propose a prioritized device-modality selection scheme that allows flexible participation of devices. Additionally, we employ a modified Particle Swarm Optimization (PSO) algorithm for efficient resource allocation. Extensive experiments validate our framework, demonstrating substantial reductions in training time and significant improvements in model performance, particularly an average improvement of 15% and 23% in test accuracy compared to the other fusion models when missing one and two modalities in the inference phase.