Contrastive Masked Autoencoders for Self-Supervised Video Hashing

被引：0

作者：

Wang, Yuting ^{[1
,3
]}

Wang, Jinpeng ^{[1
,3
]}

Chen, Bin ^{[2
]}

Zeng, Ziyun ^{[1
,3
]}

Xia, Shu-Tao ^{[1
,3
]}

机构：

[1] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Beijing, Peoples R China

[2] Harbin Inst Technol, Shenzhen, Peoples R China

[3] Peng Cheng Lab, Res Ctr Artificial Intelligence, Shenzhen, Peoples R China

来源：

THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3 | 2023年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Self-Supervised Video Hashing (SSVH) models learn to generate short binary representations for videos without ground-truth supervision, facilitating large-scale video retrieval efficiency and attracting increasing research attention. The success of SSVH lies in the understanding of video content and the ability to capture the semantic relation among unlabeled videos. Typically, state-of-the-art SSVH methods consider these two points in a two-stage training pipeline, where they firstly train an auxiliary network by instance-wise mask-and-predict tasks and secondly train a hashing model to preserve the pseudo-neighborhood structure transferred from the auxiliary network. This consecutive training strategy is inflexible and also unnecessary. In this paper, we propose a simple yet effective one-stage SSVH method called ConMH, which incorporates video semantic information and video similarity relationship understanding in a single stage. To capture video semantic information, we adopt an encoder-decoder structure to reconstruct the video from its temporal-masked frames. Particularly, we find that a higher masking ratio helps video understanding. Besides, we fully exploit the similarity relationship between videos by maximizing agreement between two augmented views of a video, which contributes to more discriminative and robust hash codes. Extensive experiments on three large-scale video datasets (i.e., FCVID, ActivityNet and YFCC) indicate that ConMH achieves state-of-the-art results. Code is available at https://github.com/ huangmozhi9527/ConMH.

引用

页码：2733 / 2741

页数：9

共 50 条

[41] A Survey on Contrastive Self-Supervised Learning
Jaiswal, Ashish
Babu, Ashwin Ramesh
Zadeh, Mohammad Zaki
Banerjee, Debapriya
Makedon, Fillia
TECHNOLOGIES, 2021, 9 (01)
[42] Self-Supervised Learning: Generative or Contrastive
Liu, Xiao
Zhang, Fanjin
Hou, Zhenyu
Mian, Li
Wang, Zhaoyu
Zhang, Jing
Tang, Jie
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (01) : 857 - 876
[43] Self-Supervised Autoencoders for Visual Anomaly Detection
Bauer, Alexander
Nakajima, Shinichi
Mueller, Klaus-Robert
MATHEMATICS, 2024, 12 (24)
[44] Cut-in maneuver detection with self-supervised contrastive video representation learning
Nalcakan, Yagiz
Bastanlar, Yalin
SIGNAL IMAGE AND VIDEO PROCESSING, 2023, 17 (06) : 2915 - 2923
[45] Cross-View Temporal Contrastive Learning for Self-Supervised Video Representation
Wang, Lulu
Xu, Zengmin
Zhang, Xuelian
Meng, Ruxing
Lu, Tao
Computer Engineering and Applications, 2024, 60 (18) : 158 - 166
[46] Attentive spatial-temporal contrastive learning for self-supervised video representation
Yang, Xingming
Xiong, Sixuan
Wu, Kewei
Shan, Dongfeng
Xie, Zhao
IMAGE AND VISION COMPUTING, 2023, 137
[47] Data-Efficient Masked Video Modeling for Self-supervised Action Recognition
Li, Qiankun
Huang, Xiaolong
Wan, Zhifan
Hu, Lanqing
Wu, Shuzhe
Zhang, Jie
Shan, Shiguang
Wang, Zengfu
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 2723 - 2733
[48] Contrastive Spatio-Temporal Pretext Learning for Self-Supervised Video Representation
Zhang, Yujia
Po, Lai-Man
Xu, Xuyuan
Liu, Mengyang
Wang, Yexin
Ou, Weifeng
Zhao, Yuzhi
Yu, Wing-Yin
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3380 - 3389
[49] Cut-in maneuver detection with self-supervised contrastive video representation learning
Yagiz Nalcakan
Yalin Bastanlar
Signal, Image and Video Processing, 2023, 17 : 2915 - 2923
[50] HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition
Sun, Licai
Lian, Zheng
Liu, Bin
Tao, Jianhua
Information Fusion, 2024, 108

← 1 2 3 4 5 →