Real-time matching between customer demands and product information via text-image retrieval remains a fundamental problem in intelligent retailing. However, this process involves challenges covering data quality, multi-modal retrieval strategies and performing efficiency. To alleviate the case, we propose a cross-modality retrieval pipeline leveraging contrastive loss and a novel sampling strategy. We also address text-image retrieval as a two-stage process, involving unsupervised clustering and contrastive feature representation. Additionally, we create an image-caption matching dataset by expanding the Grocery Store Dataset using a fundamental visual-language model. Our experiments demonstrate the effectiveness of our method on both an expanded new dataset and the well-known cross-modality retrieval benchmark, Flicker30k.