Self-Supervised Learning by Cross-Modal Audio-Video Clustering

被引：0

作者：

Alwassel, Humam ^{[1
]}

Mahajan, Dhruv ^{[2
]}

Korbar, Bruno ^{[2
]}

Torresani, Lorenzo ^{[2
]}

Ghanem, Bernard ^{[1
]}

Tran, Du ^{[2
]}

机构：

[1] King Abdullah Univ Sci & Technol KAUST, Thuwal, Saudi Arabia

[2] Facebook AI, Menlo Pk, CA USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020 | 2020年 / 33卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks. Most importantly, our video model pretrained on large-scale unlabeled data significantly outperforms the same model pretrained with full-supervision on ImageNet and Kinetics for action recognition on HMDB51 and UCF101. To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.

引用

页数：13

共 50 条

[41] Self-supervised Exclusive Learning for 3D Segmentation with Cross-modal Unsupervised Domain Adaptation
Zhang, Yachao
Li, Miaoyu
Xie, Yuan
Li, Cuihua
Wang, Cong
Zhang, Zhizhong
Qu, Yanyun
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3338 - 3346
[42] CMD: Self-supervised 3D Action Representation Learning with Cross-Modal Mutual Distillation
Mao, Yunyao
Zhou, Wengang
Lu, Zhenbo
Deng, Jiajun
Li, Houqiang
COMPUTER VISION - ECCV 2022, PT III, 2022, 13663 : 734 - 752
[43] Self-Supervised Multi-Modal Knowledge Graph Contrastive Hashing for Cross-Modal Search
Liang, Meiyu
Du, Junping
Liang, Zhengyang
Xing, Yongwang
Huang, Wei
Xue, Zhe
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 12, 2024, : 13744 - 13753
[44] A NOVEL SELF-SUPERVISED CROSS-MODAL IMAGE RETRIEVAL METHOD IN REMOTE SENSING
Sumbul, Gencer
Mueller, Markus
Demir, Beguem
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2426 - 2430
[45] Single-modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning
Ishikawa, Reina
Hachiuma, Ryo
Kurobe, Akiyoshi
Saito, Hideo
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9399 - 9406
[46] Self-Supervised Audio-Visual Feature Learning for Single-Modal Incremental Terrain Type Clustering
Ishikawa, Reina
Hachiuma, Ryo
Saito, Hideo
IEEE ACCESS, 2021, 9 : 64346 - 64357
[47] Cross-Architecture Self-supervised Video Representation Learning
Guo, Sheng
Xiong, Zihua
Zhong, Yujie
Wang, Limin
Guo, Xiaobo
Han, Bing
Huang, Weilin
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19248 - 19257
[48] Multi-Trusted Cross-Modal Information Bottleneck for 3D self-supervised representation learning
Cheng, Haozhe
Han, Xu
Shi, Pengcheng
Zhu, Jihua
Li, Zhongyu
KNOWLEDGE-BASED SYSTEMS, 2024, 283
[49] Multi-Trusted Cross-Modal Information Bottleneck for 3D self-supervised representation learning
Cheng, Haozhe
Han, Xu
Shi, Pengcheng
Zhu, Jihua
Li, Zhongyu
Knowledge-Based Systems, 2024, 283
[50] SCQ: Self-Supervised Cross-Modal Quantization for Unsupervised Large-Scale Retrieval
Nakamura, Fuga
Harakawa, Ryosuke
Iwahashi, Masahiro
PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1337 - 1342

← 1 2 3 4 5 →