Self-Supervised Learning by Cross-Modal Audio-Video Clustering

被引:0
|
作者
Alwassel, Humam [1 ]
Mahajan, Dhruv [2 ]
Korbar, Bruno [2 ]
Torresani, Lorenzo [2 ]
Ghanem, Bernard [1 ]
Tran, Du [2 ]
机构
[1] King Abdullah Univ Sci & Technol KAUST, Thuwal, Saudi Arabia
[2] Facebook AI, Menlo Pk, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks. Most importantly, our video model pretrained on large-scale unlabeled data significantly outperforms the same model pretrained with full-supervision on ImageNet and Kinetics for action recognition on HMDB51 and UCF101. To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] Self-Supervised Cross-Modal Online Learning of Basic Object Affordances for Developmental Robotic Systems
    Ridge, Barry
    Skocaj, Danijel
    Leonardis, Ales
    2010 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2010, : 5047 - 5054
  • [32] CroSSL: Cross-modal Self-Supervised Learning for Time-series through Latent Masking
    Deldari, Shohreh
    Spathis, Dimitris
    Malekzadeh, Mohammad
    Kawsar, Fahim
    Salim, Flora D.
    Mathur, Akhil
    PROCEEDINGS OF THE 17TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM 2024, 2024, : 152 - 160
  • [33] Self-supervised learning-based weight adaptive hashing for fast cross-modal retrieval
    Li, Yifan
    Wang, Xuan
    Qi, Shuhan
    Huang, Chengkai
    Jiang, Zoe L.
    Liao, Qing
    Guan, Jian
    Zhang, Jiajia
    SIGNAL IMAGE AND VIDEO PROCESSING, 2021, 15 (04) : 673 - 680
  • [34] Cross-modal Embeddings for Video and Audio Retrieval
    Suris, Didac
    Duarte, Amanda
    Salvador, Amaia
    Torres, Jordi
    Giro-i-Nieto, Xavier
    COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 711 - 716
  • [35] Self-Supervised Learning of Face Representations for Video Face Clustering
    Sharma, Vivek
    Tapaswi, Makarand
    Sarfraz, M. Saquib
    Stiefelhagen, Rainer
    2019 14TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2019), 2019, : 360 - 367
  • [36] SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding
    Sun, Chao
    Chen, Min
    Cheng, Jialiang
    Liang, Han
    Zhu, Chuanbo
    Chen, Jincai
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 261 - 270
  • [37] Self-supervised deep semantics-preserving Hashing for cross-modal retrieval
    Lu B.
    Duan X.
    Yuan Y.
    Qinghua Daxue Xuebao/Journal of Tsinghua University, 2022, 62 (09): : 1442 - 1449
  • [38] Audio self-supervised learning: A survey
    Liu, Shuo
    Mallol-Ragolta, Adria
    Parada-Cabaleiro, Emilia
    Qian, Kun
    Jing, Xin
    Kathan, Alexander
    Hu, Bin
    Schuller, Bjorn W.
    PATTERNS, 2022, 3 (12):
  • [39] Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization
    Korbar, Bruno
    Du Tran
    Torresani, Lorenzo
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [40] CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding
    Afham, Mohamed
    Dissanayake, Isuru
    Dissanayake, Dinithi
    Dharmasiri, Amaya
    Thilakarathna, Kanchana
    Rodrigo, Ranga
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 9892 - 9902