Self-Supervised Learning by Cross-Modal Audio-Video Clustering

被引：0

作者：

Alwassel, Humam ^{[1
]}

Mahajan, Dhruv ^{[2
]}

Korbar, Bruno ^{[2
]}

Torresani, Lorenzo ^{[2
]}

Ghanem, Bernard ^{[1
]}

Tran, Du ^{[2
]}

机构：

[1] King Abdullah Univ Sci & Technol KAUST, Thuwal, Saudi Arabia

[2] Facebook AI, Menlo Pk, CA USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020 | 2020年 / 33卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks. Most importantly, our video model pretrained on large-scale unlabeled data significantly outperforms the same model pretrained with full-supervision on ImageNet and Kinetics for action recognition on HMDB51 and UCF101. To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.

引用

页数：13

共 50 条

[21] Cross-Modal learning for Audio-Visual Video Parsing
Lamba, Jatin
Abhishek
Akula, Jayaprakash
Dabral, Rishabh
Jyothi, Preethi
Ramakrishnan, Ganesh
INTERSPEECH 2021, 2021, : 1937 - 1941
[22] Cross-modal Self-Supervised Learning for Lip Reading: When Contrastive Learning meets Adversarial Training
Sheng, Changchong
Pietikainen, Matti
Tian, Qi
Liu, Li
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2456 - 2464
[23] Through-Wall Human Pose Reconstruction Based on Cross-Modal Learning and Self-Supervised Learning
Zheng, Zhijie
Zhang, Diankun
Liang, Xiao
Liu, Xiaojun
Fang, Guangyou
IEEE Geoscience and Remote Sensing Letters, 2022, 19
[24] Through-Wall Human Pose Reconstruction Based on Cross-Modal Learning and Self-Supervised Learning
Zheng, Zhijie
Zhang, Diankun
Liang, Xiao
Liu, Xiaojun
Fang, Guangyou
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
[25] Autoencoder-based self-supervised hashing for cross-modal retrieval
Li, Yifan
Wang, Xuan
Cui, Lei
Zhang, Jiajia
Huang, Chengkai
Luo, Xuan
Qi, Shuhan
MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (11) : 17257 - 17274
[26] Autoencoder-based self-supervised hashing for cross-modal retrieval
Yifan Li
Xuan Wang
Lei Cui
Jiajia Zhang
Chengkai Huang
Xuan Luo
Shuhan Qi
Multimedia Tools and Applications, 2021, 80 : 17257 - 17274
[27] Self-supervised cross-modal visual retrieval from brain activities
Ye, Zesheng
Yao, Lina
Zhang, Yu
Gustin, Sylvia
PATTERN RECOGNITION, 2024, 145
[28] ICSF: Integrating Inter-Modal and Cross-Modal Learning Framework for Self-Supervised Heterogeneous Change Detection
Zhang, Erlei
Zong, He
Li, Xinyu
Feng, Mingchen
Ren, Jinchang
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63
[29] Self-supervised learning-based weight adaptive hashing for fast cross-modal retrieval
Yifan Li
Xuan Wang
Shuhan Qi
Chengkai Huang
Zoe. L Jiang
Qing Liao
Jian Guan
Jiajia Zhang
Signal, Image and Video Processing, 2021, 15 : 673 - 680
[30] Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
Wang, Xin
Huang, Qiuyuan
Celikyilmaz, Asli
Gao, Jianfeng
Shen, Dinghan
Wang, Yuan-Fang
Wang, William Yang
Zhang, Lei
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3622 - 6631

← 1 2 3 4 5 →