Dense Semantic Contrast for Self-Supervised Visual Representation Learning

被引:21
作者
Li, Xiaoni [1 ,2 ]
Zhou, Yu [1 ,2 ]
Zhang, Yifei [1 ,2 ]
Zhang, Aoting [1 ]
Wang, Wei [1 ,2 ]
Jiang, Ning [3 ]
Wu, Haiying [3 ]
Wang, Weiping [1 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China
[3] Mashang Consumer Finance Co Ltd, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021 | 2021年
基金
中国国家自然科学基金;
关键词
Self-Supervised Learning; Representation Learning; Contrastive; Learning; Dense Representation; Semantics Discovery;
D O I
10.1145/3474085.3475551
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-supervised representation learning for visual pre-training has achieved remarkable success with sample (instance or pixel) discrimination and semantics discovery of instance, whereas there still exists a non-negligible gap between pre-trained model and downstream dense prediction tasks. Concretely, these downstream tasks require more accurate representation, in other words, the pixels from the same object must belong to a shared semantic category, which is lacking in the previous methods. In this work, we present Dense Semantic Contrast (DSC) for modeling semantic category decision boundaries at a dense level to meet the requirement of these tasks. Furthermore, we propose a dense cross-image semantic contrastive learning framework for multi-granularity representation learning. Specially, we explicitly explore the semantic structure of the dataset by mining relations among pixels from different perspectives. For intra-image relation modeling, we discover pixel neighbors from multiple views. And for inter-image relations, we enforce pixel representation from the same semantic class to be more similar than the representation from different classes in one mini-batch. Experimental results show that our DSC model outperforms state-of-the-art methods when transferring to downstream dense prediction tasks, including object detection, semantic segmentation, and instance segmentation. Code will be made available.
引用
收藏
页码:1368 / 1376
页数:9
相关论文
共 48 条
  • [1] Asano Yuki M, 2020, P INT C LEARN REPR I, P7
  • [2] Caron M., 2020, Adv. Neural Inf. Process. Syst, V33, P9912
  • [3] Deep Clustering for Unsupervised Learning of Visual Features
    Caron, Mathilde
    Bojanowski, Piotr
    Joulin, Armand
    Douze, Matthijs
    [J]. COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 : 139 - 156
  • [4] Chen Ting, 2019, 25 AMERICAS C INFORM
  • [5] Knowledge-guided Deep Reinforcement Learning for Interactive Recommendation
    Chen, Xiaocong
    Huang, Chaoran
    Yao, Lina
    Wang, Xianzhi
    Liu, Wei
    Zhang, Wenjie
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [6] Chen Xinlei, 2020, ABS201110566 CORR
  • [7] Constrained Relation Network for Character Detection in Scene Images
    Chen, Yudi
    Zhou, Yu
    Yang, Dongbao
    Wang, Weiping
    [J]. PRICAI 2019: TRENDS IN ARTIFICIAL INTELLIGENCE, PT III, 2019, 11672 : 137 - 149
  • [8] Chuang Niu, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12370), P735, DOI 10.1007/978-3-030-58595-2_44
  • [9] The Cityscapes Dataset for Semantic Urban Scene Understanding
    Cordts, Marius
    Omran, Mohamed
    Ramos, Sebastian
    Rehfeld, Timo
    Enzweiler, Markus
    Benenson, Rodrigo
    Franke, Uwe
    Roth, Stefan
    Schiele, Bernt
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 3213 - 3223
  • [10] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848