With the continuous growth of multimodal data on the Internet, cross-modal retrieval tasks have garnered increasing attention. Currently, most proposed methods aim to map multimodal data into a common representation space, where samples with similar semantic content are close. However, these methods do not fully explore the similarities between multi-labels and the similarities between multi-labels and samples. Furthermore, maintaining semantic consistency between the common representations of different modalities remains an essential problem to be addressed. To tackle these issues, this paper proposes a Multi-label Guided Graph Similarity Learning (MGGSL) method. This method constructs a Multi-label (ML) graph by leveraging the similarities among multi-labels within the dataset and extracts multi-label embeddings through a graph convolutional network (GCN) to guide the learning of the common representations for different modalities. Additionally, we utilize the similarity between multi-labels and samples to construct a Visual Semantic (VS) graph and a Textual Semantic (TS) graph, and propose a graph similarity learning approach, which ensures the semantic consistency of cross-modal features from the perspectives of node similarity, adjacency matrix similarity, edge similarity, and degree similarity. Experiments were conducted on three widely used datasets: NUS-WIDE, MIRFlickr-25K, and MS-COCO. The results demonstrate that our MGGSL outperforms several existing state-of-the-art methods.