Dense Semantic Contrast for Self-Supervised Visual Representation Learning

被引：23

作者：

Li, Xiaoni ^{[1
,2
]}

Zhou, Yu ^{[1
,2
]}

Zhang, Yifei ^{[1
,2
]}

Zhang, Aoting ^{[1
]}

Wang, Wei ^{[1
,2
]}

Jiang, Ning ^{[3
]}

Wu, Haiying ^{[3
]}

Wang, Weiping ^{[1
]}

机构：

[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China

[3] Mashang Consumer Finance Co Ltd, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021 | 2021年

基金：

中国国家自然科学基金;

关键词：

Self-Supervised Learning; Representation Learning; Contrastive; Learning; Dense Representation; Semantics Discovery;

D O I：

10.1145/3474085.3475551

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Self-supervised representation learning for visual pre-training has achieved remarkable success with sample (instance or pixel) discrimination and semantics discovery of instance, whereas there still exists a non-negligible gap between pre-trained model and downstream dense prediction tasks. Concretely, these downstream tasks require more accurate representation, in other words, the pixels from the same object must belong to a shared semantic category, which is lacking in the previous methods. In this work, we present Dense Semantic Contrast (DSC) for modeling semantic category decision boundaries at a dense level to meet the requirement of these tasks. Furthermore, we propose a dense cross-image semantic contrastive learning framework for multi-granularity representation learning. Specially, we explicitly explore the semantic structure of the dataset by mining relations among pixels from different perspectives. For intra-image relation modeling, we discover pixel neighbors from multiple views. And for inter-image relations, we enforce pixel representation from the same semantic class to be more similar than the representation from different classes in one mini-batch. Experimental results show that our DSC model outperforms state-of-the-art methods when transferring to downstream dense prediction tasks, including object detection, semantic segmentation, and instance segmentation. Code will be made available.

引用

页码：1368 / 1376

页数：9

共 48 条

[1]

Asano Y. M., 2020, INT C LEARN REPR

[2]

Caron M, 2020, ADV NEUR IN, V33

[3] Deep Clustering for Unsupervised Learning of Visual Features [J].

Caron, Mathilde ;

Bojanowski, Piotr ;

Joulin, Armand ;

Douze, Matthijs .

COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 :139-156

[4]

Chen Ting, 2020, P 37 INT C MACHINE L, V119, P1597

[5]

Chen X., 2020, ABS201110566 CORR

[6] Knowledge-guided Deep Reinforcement Learning for Interactive Recommendation [J].

Chen, Xiaocong ;

Huang, Chaoran ;

Yao, Lina ;

Wang, Xianzhi ;

Liu, Wei ;

Zhang, Wenjie .

2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,

[7] Constrained Relation Network for Character Detection in Scene Images [J].

Chen, Yudi ;

Zhou, Yu ;

Yang, Dongbao ;

Wang, Weiping .

PRICAI 2019: TRENDS IN ARTIFICIAL INTELLIGENCE, PT III, 2019, 11672 :137-149

[8] The Cityscapes Dataset for Semantic Urban Scene Understanding [J].

Cordts, Marius ;

Omran, Mohamed ;

Ramos, Sebastian ;

Rehfeld, Timo ;

Enzweiler, Markus ;

Benenson, Rodrigo ;

Franke, Uwe ;

Roth, Stefan ;

Schiele, Bernt .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :3213-3223

[9]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[10] The Pascal Visual Object Classes (VOC) Challenge [J].

Everingham, Mark ;

Van Gool, Luc ;

Williams, Christopher K. I. ;

Winn, John ;

Zisserman, Andrew .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2010, 88 (02) :303-338

← 1 2 3 4 5 →