In-N-Out Generative Learning for Dense Unsupervised Video Segmentation

被引:6
|
作者
Pan, Xiao [1 ,2 ]
Li, Peike [2 ,3 ]
Yang, Zongxin [1 ]
Zhou, Huiling [2 ]
Zhou, Chang [2 ]
Yang, Hongxia [2 ]
Zhou, Jingren [2 ]
Yang, Yi [1 ]
机构
[1] Zhejiang Univ, ReLER Lab, CCAI, Hangzhou, Peoples R China
[2] Alibaba DAMO Acad, Hangzhou, Peoples R China
[3] Univ Technol Sydney, Sydney, NSW, Australia
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
关键词
unsupervised video object segmentation; self-supervised learning; dense prediction; generative learning;
D O I
10.1145/3503161.3547909
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper, we focus on unsupervised learning for Video Object Segmentation (VOS) which learns visual correspondence (i.e., the similarity between pixel-level features) from unlabeled videos. Previous methods are mainly based on the contrastive learning paradigm, which optimize either in image level or pixel level. Imagelevel optimization (e.g., the spatially pooled feature of ResNet) learns robust high-level semantics but is sub-optimal since the pixel-level features are optimized implicitly. By contrast, pixel-level optimization is more explicit, however, it is sensitive to the visual quality of training data and is not robust to object deformation. To complementarily perform these two levels of optimization in a unified framework, we propose the In-aNd-Out (INO) generative learning from a purely generative perspective with the help of naturally designed class tokens and patch tokens in Vision Transformer (ViT). Specifically, for image-level optimization, we force the out-view imagination from local to global views on class tokens, which helps capture high-level semantics, and we name it as out-generative learning. As to pixel-level optimization, we perform in-view masked image modeling on patch tokens, which recovers the corrupted parts of an image via inferring its fine-grained structure, and we term it as in-generative learning. To discover the temporal information better, we additionally force the inter-frame consistency from both feature and affinity matrix levels. Extensive experiments on DAVIS-2017 val and YouTube-VOS 2018 val show that our INO outperforms previous state-of-the-art methods by significant margins.
引用
收藏
页码:1819 / 1827
页数:9
相关论文
共 50 条
  • [1] Dense Unsupervised Learning for Video Segmentation
    Araslanov, Nikita
    Schaub-Meyer, Simone
    Roth, Stefan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [2] Generative Cooperative Learning for Unsupervised Video Anomaly Detection
    Zaheer, M. Zaigham
    Mahmood, Arif
    Khan, M. Haris
    Segu, Mattia
    Yu, Fisher
    Lee, Seung-Ik
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 14724 - 14734
  • [3] Unsupervised Learning of Supervoxel Embeddings for Video Segmentation
    Khodabandeh, Mehran
    Muralidharan, Srikanth
    Vahdat, Arash
    Mehrasa, Nazanin
    Pereira, Eduardo M.
    Satoh, Shin'ichi
    Mori, Greg
    2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 2392 - 2397
  • [4] Bidirectionally Learning Dense Spatio-temporal Feature Propagation Network for Unsupervised Video Object Segmentation
    Fan, Jiaqing
    Su, Tiankang
    Zhang, Kaihua
    Liu, Qingshan
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3646 - 3655
  • [5] Unsupervised Learning and Segmentation of Complex Activities from Video
    Sener, Fadime
    Yao, Angela
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8368 - 8376
  • [6] Unsupervised Video Object Segmentation for Deep Reinforcement Learning
    Goel, Vik
    Weng, Jameson
    Poupart, Pascal
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [7] Unsupervised Transfer Learning For Video Prediction Based on Generative Adversarial Network
    Shi, Jiwen
    Zhu, Qiuguo
    Wu, Jun
    2021 27TH INTERNATIONAL CONFERENCE ON MECHATRONICS AND MACHINE VISION IN PRACTICE (M2VIP), 2021,
  • [8] Learning Motion Guidance for Efficient Unsupervised Video Object Segmentation
    Zhao Z.-C.
    Zhang K.-H.
    Fan J.-Q.
    Liu Q.-S.
    Zidonghua Xuebao/Acta Automatica Sinica, 2023, 49 (04): : 872 - 880
  • [9] Learning Unsupervised Video Object Segmentation through Visual Attention
    Wang, Wenguan
    Song, Hongmei
    Zhao, Shuyang
    Shen, Jianbing
    Zhao, Sanyuan
    Hoi, Steven C. H.
    Ling, Haibin
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3059 - 3069
  • [10] Unsupervised video object segmentation: an affinity and edge learning approach
    Sundaram Muthu
    Ruwan Tennakoon
    Reza Hoseinnezhad
    Alireza Bab-Hadiashar
    International Journal of Machine Learning and Cybernetics, 2022, 13 : 3589 - 3605