From Pixels to Semantics: Self-Supervised Video Object Segmentation With Multiperspective Feature Mining

被引:5
作者
Li, Ruoqi [1 ]
Wang, Yifan [2 ,3 ]
Wang, Lijun [3 ,4 ]
Lu, Huchuan [1 ,3 ]
Wei, Xiaopeng [5 ]
Zhang, Qiang [5 ]
机构
[1] Dalian Univ Technol, Sch Informat & Commun Engn, Dalian 116024, Peoples R China
[2] Dalian Univ Technol, Sch Innovat & Entrepreneurship, Dalian 116024, Peoples R China
[3] Dalian Univ Technol, Ningbo Inst, Ningbo 315012, Peoples R China
[4] Dalian Univ Technol, Sch Artificial Intelligence, Dalian 116024, Peoples R China
[5] Dalian Univ Technol, Sch Comp Sci & Technol, Dalian 116024, Peoples R China
基金
中国国家自然科学基金;
关键词
Semantics; Training; Feature extraction; Image reconstruction; Task analysis; Object segmentation; Image segmentation; Video object segmentation; self-supervised learning; pixel-level correspondence; semantic-level adaption; feature mining;
D O I
10.1109/TIP.2022.3201603
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing self-supervised methods pose one-shot video object segmentation (O-VOS) as pixel-level matching to enable segmentation mask propagation across frames. However, the two tasks are not fully equivalent since O-VOS is more reliant on semantic correspondence rather than accurate pixel matching. To remedy this issue, we explore a new self-supervised framework that integrates pixel-level correspondence learning with semantic-level adaptation. The pixel-level correspondence learning is performed through photometric reconstruction of adjacent RGB frames during offline training, while semantic-level adaption operates at test-time by enforcing a bi-directional agreement of the predicted segmentation masks. In addition, we further propose a new network architecture with multi-perspective feature mining mechanism which can not only enhance reliable features but also suppress noisy ones to facilitate more robust image matching. By training the network using the proposed self-supervised framework, we achieve state-of-the-art performance on widely adopted datasets, further closing up the gap between self-supervised learning methods and their fully supervised counterparts.
引用
收藏
页码:5801 / 5812
页数:12
相关论文
共 67 条
[1]  
Bluche T, 2016, ADV NEUR IN, V29
[2]   One-Shot Video Object Segmentation [J].
Caelles, S. ;
Maninis, K. -K. ;
Pont-Tuset, J. ;
Leal-Taixe, L. ;
Cremers, D. ;
Van Gool, L. .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :5320-5329
[3]   Look and Think Twice: Capturing Top-Down Visual Attention with Feedback Convolutional Neural Networks [J].
Cao, Chunshui ;
Liu, Xianming ;
Yang, Yi ;
Yu, Yinan ;
Wang, Jiang ;
Wang, Zilei ;
Huang, Yongzhen ;
Wang, Liang ;
Huang, Chang ;
Xu, Wei ;
Ramanan, Deva ;
Huang, Thomas S. .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2956-2964
[4]   Emerging Properties in Self-Supervised Vision Transformers [J].
Caron, Mathilde ;
Touvron, Hugo ;
Misra, Ishan ;
Jegou, Herve ;
Mairal, Julien ;
Bojanowski, Piotr ;
Joulin, Armand .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640
[5]   SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning [J].
Chen, Long ;
Zhang, Hanwang ;
Xiao, Jun ;
Nie, Liqiang ;
Shao, Jian ;
Liu, Wei ;
Chua, Tat-Seng .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6298-6306
[6]   Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning [J].
Chen, Yuhua ;
Pont-Tuset, Jordi ;
Montes, Alberto ;
Van Gool, Luc .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1189-1198
[7]  
Fang Zhiyuan, 2021, ICLR
[8]  
Fu J, 2019, Arxiv, DOI arXiv:1809.02983
[9]  
Gao Y., 2022, PROC EUR C COMPUT VI, P1
[10]   SCNet: Learning Semantic Correspondence [J].
Han, Kai ;
Rezende, Rafael S. ;
Ham, Bumsub ;
Wong, Kwan-Yee K. ;
Cho, Minsu ;
Schmid, Cordelia ;
Ponce, Jean .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :1849-1858