A Hard Negatives Mining and Enhancing Method for Multi-Modal Contrastive Learning

被引:0
作者
Li, Guangping [1 ]
Gao, Yanan [1 ]
Huang, Xianhui [1 ]
Ling, Bingo Wing-Kuen [1 ]
机构
[1] Guangdong Univ Technol, Sch Informat Engn, Guangzhou 510000, Peoples R China
来源
ELECTRONICS | 2025年 / 14卷 / 04期
关键词
hard negatives; contrastive learning; multi-modal;
D O I
10.3390/electronics14040767
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Contrastive learning has emerged as a dominant paradigm for understanding 3D open-world environments, particularly in the realm of multi-modalities. However, due to the nature of self-supervised learning and the limited size of 3D datasets, pre-trained models in the 3D point cloud domain often suffer from overfitting in downstream tasks, especially in zero-shot classification. To tackle this problem, we design a module to mine and enhance hard negatives from datasets, which are useful to improve the discrimination of models. This module could be seamlessly integrated into cross-modal contrastive learning frameworks, addressing the overfitting issue by enhancing the mined hard negatives during the process of training. This module consists of two key components: mining and enhancing. In the process of mining, we identify hard negative samples by examining similarity relationships between vision-vision and vision-text modalities, locating hard negative pairs within the visual domain. In the process of enhancing, we compute weighting coefficients via the similarity differences of these mined hard negatives. By enhancing the mined hard negatives while leaving others unchanged, we improve the overall performance and discrimination of models. A series of experiments demonstrate that our module can be easily incorporated into various contrastive learning frameworks, leading to improved model performance in both zero-shot and few-shot tasks.
引用
收藏
页数:15
相关论文
共 46 条
  • [1] Xue L., Gao M., Xing C., Martin-Martin R., Wu J., Xiong C., Xu R., Niebles J.C., Savarese S., ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding, Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1179-1189
  • [2] Lei W., Ge Y., Yi K., Zhang J., Gao D., Sun D., Ge Y., Shan Y., Shou M.Z., ViT-Lens: Towards Omni-modal Representations, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 26647-26657
  • [3] Zhou J., Wang J., Ma B., Liu Y.S., Huang T., Wang X., Uni3D: Exploring Unified 3D Representation at Scale, Proceedings of the The Twelfth International Conference on Learning Representations
  • [4] Radford A., Kim J.W., Hallacy C., Ramesh A., Goh G., Agarwal S., Sastry G., Askell A., Mishkin P., Clark J., Et al., Learning Transferable Visual Models From Natural Language Supervision, Proceedings of the 38th International Conference on Machine Learning, 139, pp. 8748-8763
  • [5] Zhang R., Guo Z., Zhang W., Li K., Miao X., Cui B., Qiao Y., Gao P., Li H., PointCLIP: Point Cloud Understanding by CLIP, Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8542-8552
  • [6] Huang T., Dong B., Yang Y., Huang X., Lau R.W., Ouyang W., Zuo W., CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-Training, Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 22100-22110
  • [7] Tang H., Li Z., Zhang D., He S., Tang J., Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection, IEEE Trans. Pattern Anal. Mach. Intell, 47, pp. 1-17, (2024)
  • [8] Liu M., Shi R., Kuang K., Zhu Y., Li X., Han S., Cai H., Porikli F., Su H., OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding, Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 36, pp. 44860-44879, (2023)
  • [9] Gao Y., Wang Z., Zheng W.S., Xie C., Zhou Y., Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-Training, Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22998-23008
  • [10] Robinson J.D., Chuang C.Y., Sra S., Jegelka S., Contrastive Learning with Hard Negative Samples, Proceedings of the International Conference on Learning Representations