In-N-Out Generative Learning for Dense Unsupervised Video Segmentation

被引：6

作者：

Pan, Xiao ^{[1
,2
]}

Li, Peike ^{[2
,3
]}

Yang, Zongxin ^{[1
]}

Zhou, Huiling ^{[2
]}

Zhou, Chang ^{[2
]}

Yang, Hongxia ^{[2
]}

Zhou, Jingren ^{[2
]}

Yang, Yi ^{[1
]}

机构：

[1] Zhejiang Univ, ReLER Lab, CCAI, Hangzhou, Peoples R China

[2] Alibaba DAMO Acad, Hangzhou, Peoples R China

[3] Univ Technol Sydney, Sydney, NSW, Australia

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

关键词：

unsupervised video object segmentation; self-supervised learning; dense prediction; generative learning;

D O I：

10.1145/3503161.3547909

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

In this paper, we focus on unsupervised learning for Video Object Segmentation (VOS) which learns visual correspondence (i.e., the similarity between pixel-level features) from unlabeled videos. Previous methods are mainly based on the contrastive learning paradigm, which optimize either in image level or pixel level. Imagelevel optimization (e.g., the spatially pooled feature of ResNet) learns robust high-level semantics but is sub-optimal since the pixel-level features are optimized implicitly. By contrast, pixel-level optimization is more explicit, however, it is sensitive to the visual quality of training data and is not robust to object deformation. To complementarily perform these two levels of optimization in a unified framework, we propose the In-aNd-Out (INO) generative learning from a purely generative perspective with the help of naturally designed class tokens and patch tokens in Vision Transformer (ViT). Specifically, for image-level optimization, we force the out-view imagination from local to global views on class tokens, which helps capture high-level semantics, and we name it as out-generative learning. As to pixel-level optimization, we perform in-view masked image modeling on patch tokens, which recovers the corrupted parts of an image via inferring its fine-grained structure, and we term it as in-generative learning. To discover the temporal information better, we additionally force the inter-frame consistency from both feature and affinity matrix levels. Extensive experiments on DAVIS-2017 val and YouTube-VOS 2018 val show that our INO outperforms previous state-of-the-art methods by significant margins.

引用

页码：1819 / 1827

页数：9

共 50 条

[21] UNSUPERVISED GENERATIVE VARIATIONAL CONTINUAL LEARNING
Liu Guimeng
Yang, Guo
Yin, Cheryl Wong Sze
Suganathan, Ponnuthurai Nagartnam
Savitha, Ramasamy
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 4028 - 4032
[22] Unsupervised Segmentation of 3D Microvascular Photoacoustic Images Using Deep Generative Learning
Sweeney, Paul W.
Hacker, Lina
Lefebvre, Thierry L.
Brown, Emma L.
Grohl, Janek
Bohndiek, Sarah E.
ADVANCED SCIENCE, 2024, 11 (32)
[23] Unsupervised Learning of Dense Shape Correspondence
Halimi, Oshri
Litany, Or
Rodola, Emanuele
Bronstein, Alex
Kimmel, Ron
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4365 - 4374
[24] Unsupervised Learning of Dense Visual Representations
Pinheiro, Pedro O.
Almahairi, Amjad
Benmalek, Ryan Y.
Golemo, Florian
Courville, Aaron
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[25] Reciprocal Transformations for Unsupervised Video Object Segmentation
Ren, Sucheng
Liu, Wenxi
Liu, Yongtuo
Chen, Haoxin
Han, Guoqiang
He, Shengfeng
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15450 - 15459
[26] Anchor Diffusion for Unsupervised Video Object Segmentation
Yang, Zhao
Wang, Qiang
Bertinetto, Luca
Hu, Weiming
Bai, Song
Torr, Philip H. S.
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 931 - 940
[27] DECNet: Dense embedding contrast for unsupervised semantic segmentation
Zhang, Xiaoqin
Chen, Baiyu
Zhou, Xiaolong
Chan, Sixian
NEURAL NETWORKS, 2024, 179
[28] Unsupervised Video Object Segmentation by Supertrajectory Labeling
Masuda, Masahiro
Mochizuki, Yoshihiko
Ishikawa, Hiroshi
PROCEEDINGS OF THE FIFTEENTH IAPR INTERNATIONAL CONFERENCE ON MACHINE VISION APPLICATIONS - MVA2017, 2017, : 448 - 451
[29] F2Net: Learning to Focus on the Foreground for Unsupervised Video Object Segmentation
Liu, Daizong
Yu, Dongdong
Wang, Changhu
Zhou, Pan
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2109 - 2117
[30] Dense video captioning using unsupervised semantic information
Estevam, Valter
Laroca, Rayson
Pedrini, Helio
Menotti, David
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2025, 107

← 1 2 3 4 5 →