Dense Semantic Contrast for Self-Supervised Visual Representation Learning

被引：23

作者：

Li, Xiaoni ^{[1
,2
]}

Zhou, Yu ^{[1
,2
]}

Zhang, Yifei ^{[1
,2
]}

Zhang, Aoting ^{[1
]}

Wang, Wei ^{[1
,2
]}

Jiang, Ning ^{[3
]}

Wu, Haiying ^{[3
]}

Wang, Weiping ^{[1
]}

机构：

[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China

[3] Mashang Consumer Finance Co Ltd, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021 | 2021年

基金：

中国国家自然科学基金;

关键词：

Self-Supervised Learning; Representation Learning; Contrastive; Learning; Dense Representation; Semantics Discovery;

D O I：

10.1145/3474085.3475551

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Self-supervised representation learning for visual pre-training has achieved remarkable success with sample (instance or pixel) discrimination and semantics discovery of instance, whereas there still exists a non-negligible gap between pre-trained model and downstream dense prediction tasks. Concretely, these downstream tasks require more accurate representation, in other words, the pixels from the same object must belong to a shared semantic category, which is lacking in the previous methods. In this work, we present Dense Semantic Contrast (DSC) for modeling semantic category decision boundaries at a dense level to meet the requirement of these tasks. Furthermore, we propose a dense cross-image semantic contrastive learning framework for multi-granularity representation learning. Specially, we explicitly explore the semantic structure of the dataset by mining relations among pixels from different perspectives. For intra-image relation modeling, we discover pixel neighbors from multiple views. And for inter-image relations, we enforce pixel representation from the same semantic class to be more similar than the representation from different classes in one mini-batch. Experimental results show that our DSC model outperforms state-of-the-art methods when transferring to downstream dense prediction tasks, including object detection, semantic segmentation, and instance segmentation. Code will be made available.

引用

页码：1368 / 1376

页数：9

共 48 条

[41]

Yang Dongbao, 2020, ABS200713428 CORR

[42]

Yang Dongbao, 2021, ABS210701787 CORR

[43] Video Playback Rate Perception for Self-supervised Spatio-Temporal Representation Learning [J].

Yao, Yuan ;

Liu, Chang ;

Luo, Dezhao ;

Zhou, Yu ;

Ye, Qixiang .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :6547-6556

[44]

Yudi Chen, 2020, 2020 25th International Conference on Pattern Recognition (ICPR), P850, DOI 10.1109/ICPR48806.2021.9412558

[45] Online Deep Clustering for Unsupervised Representation Learning [J].

Zhan, Xiaohang ;

Xie, Jiahao ;

Liu, Ziwei ;

Ong, Yew-Soon ;

Loy, Chen Change .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :6687-6696

[46]

Zhang Y., 2020, ICPR, P8476

[47]

Zhang Yifei, 2021, ABS210503341 CORR

[48] Local Aggregation for Unsupervised Learning of Visual Embeddings [J].

Zhuang, Chengxu ;

Zhai, Alex Lin ;

Yamins, Daniel .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6001-6011

← 1 2 3 4 5 →