Self-Supervised Learning by Cross-Modal Audio-Video Clustering

被引：0

作者：

Alwassel, Humam ^{[1
]}

Mahajan, Dhruv ^{[2
]}

Korbar, Bruno ^{[2
]}

Torresani, Lorenzo ^{[2
]}

Ghanem, Bernard ^{[1
]}

Tran, Du ^{[2
]}

机构：

[1] King Abdullah Univ Sci & Technol KAUST, Thuwal, Saudi Arabia

[2] Facebook AI, Menlo Pk, CA USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020 | 2020年 / 33卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks. Most importantly, our video model pretrained on large-scale unlabeled data significantly outperforms the same model pretrained with full-supervision on ImageNet and Kinetics for action recognition on HMDB51 and UCF101. To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.

引用

页数：13

共 50 条

[1] Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning
Das, Srijan
Ryoo, Michael
2023 18TH INTERNATIONAL CONFERENCE ON MACHINE VISION AND APPLICATIONS, MVA, 2023,
[2] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
Sarkar, Pritam
Etemad, Ali
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9723 - 9732
[3] Self-Supervised Correlation Learning for Cross-Modal Retrieval
Liu, Yaxin
Wu, Jianlong
Qu, Leigang
Gan, Tian
Yin, Jianhua
Nie, Liqiang
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2851 - 2863
[4] SELF-SUPERVISED LEARNING WITH CROSS-MODAL TRANSFORMERS FOR EMOTION RECOGNITION
Khare, Aparna
Parthasarathy, Srinivas
Sundaram, Shiva
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 381 - 388
[5] Self-supervised incomplete cross-modal hashing retrieval
Peng, Shouyong
Yao, Tao
Li, Ying
Wang, Gang
Wang, Lili
Yan, Zhiming
EXPERT SYSTEMS WITH APPLICATIONS, 2025, 262
[6] Self-Supervised Visual Representations for Cross-Modal Retrieval
Patel, Yash
Gomez, Lluis
Rusinol, Marcal
Karatzas, Dimosthenis
Jawahar, C., V
ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 182 - 186
[7] POSITIVE AND NEGATIVE SAMPLING STRATEGIES FOR SELF-SUPERVISED LEARNING ON AUDIO-VIDEO DATA<bold> </bold>
Wang, Shanshan
Tripathy, Soumya
Heittola, Toni
Mesaros, Annamaria
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 545 - 549
[8] CCMA: CapsNet for audio-video sentiment analysis using cross-modal attention
Li, Haibin
Guo, Aodi
Li, Yaqian
VISUAL COMPUTER, 2025, 41 (03): : 1609 - 1620
[9] Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning
Salvador, Amaia
Gundogdu, Erhan
Bazzani, Loris
Donoser, Michael
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15470 - 15479
[10] Learning Mutual Modulation for Self-supervised Cross-Modal Super-Resolution
Dong, Xiaoyu
Yokoya, Naoto
Wang, Longguang
Uezato, Tatsumi
COMPUTER VISION, ECCV 2022, PT XIX, 2022, 13679 : 1 - 18

← 1 2 3 4 5 →