Contrastive Masked Autoencoders for Self-Supervised Video Hashing

被引:0
|
作者
Wang, Yuting [1 ,3 ]
Wang, Jinpeng [1 ,3 ]
Chen, Bin [2 ]
Zeng, Ziyun [1 ,3 ]
Xia, Shu-Tao [1 ,3 ]
机构
[1] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Beijing, Peoples R China
[2] Harbin Inst Technol, Shenzhen, Peoples R China
[3] Peng Cheng Lab, Res Ctr Artificial Intelligence, Shenzhen, Peoples R China
来源
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3 | 2023年
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-Supervised Video Hashing (SSVH) models learn to generate short binary representations for videos without ground-truth supervision, facilitating large-scale video retrieval efficiency and attracting increasing research attention. The success of SSVH lies in the understanding of video content and the ability to capture the semantic relation among unlabeled videos. Typically, state-of-the-art SSVH methods consider these two points in a two-stage training pipeline, where they firstly train an auxiliary network by instance-wise mask-and-predict tasks and secondly train a hashing model to preserve the pseudo-neighborhood structure transferred from the auxiliary network. This consecutive training strategy is inflexible and also unnecessary. In this paper, we propose a simple yet effective one-stage SSVH method called ConMH, which incorporates video semantic information and video similarity relationship understanding in a single stage. To capture video semantic information, we adopt an encoder-decoder structure to reconstruct the video from its temporal-masked frames. Particularly, we find that a higher masking ratio helps video understanding. Besides, we fully exploit the similarity relationship between videos by maximizing agreement between two augmented views of a video, which contributes to more discriminative and robust hash codes. Extensive experiments on three large-scale video datasets (i.e., FCVID, ActivityNet and YFCC) indicate that ConMH achieves state-of-the-art results. Code is available at https://github.com/ huangmozhi9527/ConMH.
引用
收藏
页码:2733 / 2741
页数:9
相关论文
共 50 条
  • [41] A Survey on Contrastive Self-Supervised Learning
    Jaiswal, Ashish
    Babu, Ashwin Ramesh
    Zadeh, Mohammad Zaki
    Banerjee, Debapriya
    Makedon, Fillia
    TECHNOLOGIES, 2021, 9 (01)
  • [42] Self-Supervised Learning: Generative or Contrastive
    Liu, Xiao
    Zhang, Fanjin
    Hou, Zhenyu
    Mian, Li
    Wang, Zhaoyu
    Zhang, Jing
    Tang, Jie
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (01) : 857 - 876
  • [43] Self-Supervised Autoencoders for Visual Anomaly Detection
    Bauer, Alexander
    Nakajima, Shinichi
    Mueller, Klaus-Robert
    MATHEMATICS, 2024, 12 (24)
  • [44] Cut-in maneuver detection with self-supervised contrastive video representation learning
    Nalcakan, Yagiz
    Bastanlar, Yalin
    SIGNAL IMAGE AND VIDEO PROCESSING, 2023, 17 (06) : 2915 - 2923
  • [45] Cross-View Temporal Contrastive Learning for Self-Supervised Video Representation
    Wang, Lulu
    Xu, Zengmin
    Zhang, Xuelian
    Meng, Ruxing
    Lu, Tao
    Computer Engineering and Applications, 2024, 60 (18) : 158 - 166
  • [46] Attentive spatial-temporal contrastive learning for self-supervised video representation
    Yang, Xingming
    Xiong, Sixuan
    Wu, Kewei
    Shan, Dongfeng
    Xie, Zhao
    IMAGE AND VISION COMPUTING, 2023, 137
  • [47] Data-Efficient Masked Video Modeling for Self-supervised Action Recognition
    Li, Qiankun
    Huang, Xiaolong
    Wan, Zhifan
    Hu, Lanqing
    Wu, Shuzhe
    Zhang, Jie
    Shan, Shiguang
    Wang, Zengfu
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 2723 - 2733
  • [48] Contrastive Spatio-Temporal Pretext Learning for Self-Supervised Video Representation
    Zhang, Yujia
    Po, Lai-Man
    Xu, Xuyuan
    Liu, Mengyang
    Wang, Yexin
    Ou, Weifeng
    Zhao, Yuzhi
    Yu, Wing-Yin
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3380 - 3389
  • [49] Cut-in maneuver detection with self-supervised contrastive video representation learning
    Yagiz Nalcakan
    Yalin Bastanlar
    Signal, Image and Video Processing, 2023, 17 : 2915 - 2923
  • [50] HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition
    Sun, Licai
    Lian, Zheng
    Liu, Bin
    Tao, Jianhua
    Information Fusion, 2024, 108