Multi-modal semantic communication has attracted great attention due to its broad application prospects. However, the existing multi-modal semantic communications mostly focus on task-oriented approaches, which ignore the correlation among multi-modal data, leading to a decrease in the robustness. In this paper, we propose a deep learning enabled semantic communication system with cross-modal alignment, called CA DeepSC, which effectively utilizes the correlation across multi-modal signals to enhance the robustness of transmission. Firstly, we train the semantic encoder at the transmitter to learn the relationship of cross-modal alignment at the semantic level. Meanwhile, the cross-modal alignment allows to modify the errors caused by semantic or physical noise. Secondly, we propose a novel cross-modal amendment scheme that dynamically assigns weights to auxiliary multi-modal semantic information based on their correlation levels, and integrates modal semantic information with auxiliary modal semantic information at the receiver, optimizing the performance on recovery. Finally, experimental results demonstrate that CA DeepSC effectively reduces semantic distortion caused by semantic and physical noise, thereby improving the quality and robustness in the mult-imodal semantic communication.