In-N-Out Generative Learning for Dense Unsupervised Video Segmentation

被引:6
|
作者
Pan, Xiao [1 ,2 ]
Li, Peike [2 ,3 ]
Yang, Zongxin [1 ]
Zhou, Huiling [2 ]
Zhou, Chang [2 ]
Yang, Hongxia [2 ]
Zhou, Jingren [2 ]
Yang, Yi [1 ]
机构
[1] Zhejiang Univ, ReLER Lab, CCAI, Hangzhou, Peoples R China
[2] Alibaba DAMO Acad, Hangzhou, Peoples R China
[3] Univ Technol Sydney, Sydney, NSW, Australia
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
关键词
unsupervised video object segmentation; self-supervised learning; dense prediction; generative learning;
D O I
10.1145/3503161.3547909
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper, we focus on unsupervised learning for Video Object Segmentation (VOS) which learns visual correspondence (i.e., the similarity between pixel-level features) from unlabeled videos. Previous methods are mainly based on the contrastive learning paradigm, which optimize either in image level or pixel level. Imagelevel optimization (e.g., the spatially pooled feature of ResNet) learns robust high-level semantics but is sub-optimal since the pixel-level features are optimized implicitly. By contrast, pixel-level optimization is more explicit, however, it is sensitive to the visual quality of training data and is not robust to object deformation. To complementarily perform these two levels of optimization in a unified framework, we propose the In-aNd-Out (INO) generative learning from a purely generative perspective with the help of naturally designed class tokens and patch tokens in Vision Transformer (ViT). Specifically, for image-level optimization, we force the out-view imagination from local to global views on class tokens, which helps capture high-level semantics, and we name it as out-generative learning. As to pixel-level optimization, we perform in-view masked image modeling on patch tokens, which recovers the corrupted parts of an image via inferring its fine-grained structure, and we term it as in-generative learning. To discover the temporal information better, we additionally force the inter-frame consistency from both feature and affinity matrix levels. Extensive experiments on DAVIS-2017 val and YouTube-VOS 2018 val show that our INO outperforms previous state-of-the-art methods by significant margins.
引用
收藏
页码:1819 / 1827
页数:9
相关论文
共 50 条
  • [21] UNSUPERVISED GENERATIVE VARIATIONAL CONTINUAL LEARNING
    Liu Guimeng
    Yang, Guo
    Yin, Cheryl Wong Sze
    Suganathan, Ponnuthurai Nagartnam
    Savitha, Ramasamy
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 4028 - 4032
  • [22] Unsupervised Segmentation of 3D Microvascular Photoacoustic Images Using Deep Generative Learning
    Sweeney, Paul W.
    Hacker, Lina
    Lefebvre, Thierry L.
    Brown, Emma L.
    Grohl, Janek
    Bohndiek, Sarah E.
    ADVANCED SCIENCE, 2024, 11 (32)
  • [23] Unsupervised Learning of Dense Shape Correspondence
    Halimi, Oshri
    Litany, Or
    Rodola, Emanuele
    Bronstein, Alex
    Kimmel, Ron
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4365 - 4374
  • [24] Unsupervised Learning of Dense Visual Representations
    Pinheiro, Pedro O.
    Almahairi, Amjad
    Benmalek, Ryan Y.
    Golemo, Florian
    Courville, Aaron
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [25] Reciprocal Transformations for Unsupervised Video Object Segmentation
    Ren, Sucheng
    Liu, Wenxi
    Liu, Yongtuo
    Chen, Haoxin
    Han, Guoqiang
    He, Shengfeng
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15450 - 15459
  • [26] Anchor Diffusion for Unsupervised Video Object Segmentation
    Yang, Zhao
    Wang, Qiang
    Bertinetto, Luca
    Hu, Weiming
    Bai, Song
    Torr, Philip H. S.
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 931 - 940
  • [27] DECNet: Dense embedding contrast for unsupervised semantic segmentation
    Zhang, Xiaoqin
    Chen, Baiyu
    Zhou, Xiaolong
    Chan, Sixian
    NEURAL NETWORKS, 2024, 179
  • [28] Unsupervised Video Object Segmentation by Supertrajectory Labeling
    Masuda, Masahiro
    Mochizuki, Yoshihiko
    Ishikawa, Hiroshi
    PROCEEDINGS OF THE FIFTEENTH IAPR INTERNATIONAL CONFERENCE ON MACHINE VISION APPLICATIONS - MVA2017, 2017, : 448 - 451
  • [29] F2Net: Learning to Focus on the Foreground for Unsupervised Video Object Segmentation
    Liu, Daizong
    Yu, Dongdong
    Wang, Changhu
    Zhou, Pan
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2109 - 2117
  • [30] Dense video captioning using unsupervised semantic information
    Estevam, Valter
    Laroca, Rayson
    Pedrini, Helio
    Menotti, David
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2025, 107