From Pixels to Semantics: Self-Supervised Video Object Segmentation With Multiperspective Feature Mining

被引：5

作者：

Li, Ruoqi ^{[1
]}

Wang, Yifan ^{[2
,3
]}

Wang, Lijun ^{[3
,4
]}

Lu, Huchuan ^{[1
,3
]}

Wei, Xiaopeng ^{[5
]}

Zhang, Qiang ^{[5
]}

机构：

[1] Dalian Univ Technol, Sch Informat & Commun Engn, Dalian 116024, Peoples R China

[2] Dalian Univ Technol, Sch Innovat & Entrepreneurship, Dalian 116024, Peoples R China

[3] Dalian Univ Technol, Ningbo Inst, Ningbo 315012, Peoples R China

[4] Dalian Univ Technol, Sch Artificial Intelligence, Dalian 116024, Peoples R China

[5] Dalian Univ Technol, Sch Comp Sci & Technol, Dalian 116024, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2022年 / 31卷

基金：

中国国家自然科学基金;

关键词：

Semantics; Training; Feature extraction; Image reconstruction; Task analysis; Object segmentation; Image segmentation; Video object segmentation; self-supervised learning; pixel-level correspondence; semantic-level adaption; feature mining;

D O I：

10.1109/TIP.2022.3201603

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Existing self-supervised methods pose one-shot video object segmentation (O-VOS) as pixel-level matching to enable segmentation mask propagation across frames. However, the two tasks are not fully equivalent since O-VOS is more reliant on semantic correspondence rather than accurate pixel matching. To remedy this issue, we explore a new self-supervised framework that integrates pixel-level correspondence learning with semantic-level adaptation. The pixel-level correspondence learning is performed through photometric reconstruction of adjacent RGB frames during offline training, while semantic-level adaption operates at test-time by enforcing a bi-directional agreement of the predicted segmentation masks. In addition, we further propose a new network architecture with multi-perspective feature mining mechanism which can not only enhance reliable features but also suppress noisy ones to facilitate more robust image matching. By training the network using the proposed self-supervised framework, we achieve state-of-the-art performance on widely adopted datasets, further closing up the gap between self-supervised learning methods and their fully supervised counterparts.

引用

页码：5801 / 5812

页数：12

共 67 条

[1]

Bluche T, 2016, ADV NEUR IN, V29

[2] One-Shot Video Object Segmentation [J].

Caelles, S. ;

Maninis, K. -K. ;

Pont-Tuset, J. ;

Leal-Taixe, L. ;

Cremers, D. ;

Van Gool, L. .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :5320-5329

[3] Look and Think Twice: Capturing Top-Down Visual Attention with Feedback Convolutional Neural Networks [J].

Cao, Chunshui ;

Liu, Xianming ;

Yang, Yi ;

Yu, Yinan ;

Wang, Jiang ;

Wang, Zilei ;

Huang, Yongzhen ;

Wang, Liang ;

Huang, Chang ;

Xu, Wei ;

Ramanan, Deva ;

Huang, Thomas S. .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2956-2964

[4] Emerging Properties in Self-Supervised Vision Transformers [J].

Caron, Mathilde ;

Touvron, Hugo ;

Misra, Ishan ;

Jegou, Herve ;

Mairal, Julien ;

Bojanowski, Piotr ;

Joulin, Armand .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640

[5] SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning [J].

Chen, Long ;

Zhang, Hanwang ;

Xiao, Jun ;

Nie, Liqiang ;

Shao, Jian ;

Liu, Wei ;

Chua, Tat-Seng .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6298-6306

[6] Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning [J].

Chen, Yuhua ;

Pont-Tuset, Jordi ;

Montes, Alberto ;

Van Gool, Luc .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1189-1198

[7]

Fang Zhiyuan, 2021, ICLR

[8]

Fu J, 2019, Arxiv, DOI arXiv:1809.02983

[9]

Gao Y., 2022, PROC EUR C COMPUT VI, P1

[10] SCNet: Learning Semantic Correspondence [J].

Han, Kai ;

Rezende, Rafael S. ;

Ham, Bumsub ;

Wong, Kwan-Yee K. ;

Cho, Minsu ;

Schmid, Cordelia ;

Ponce, Jean .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :1849-1858

← 1 2 3 4 5 6 7 →