Self-Supervised Learning by Cross-Modal Audio-Video Clustering

被引:0
|
作者
Alwassel, Humam [1 ]
Mahajan, Dhruv [2 ]
Korbar, Bruno [2 ]
Torresani, Lorenzo [2 ]
Ghanem, Bernard [1 ]
Tran, Du [2 ]
机构
[1] King Abdullah Univ Sci & Technol KAUST, Thuwal, Saudi Arabia
[2] Facebook AI, Menlo Pk, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks. Most importantly, our video model pretrained on large-scale unlabeled data significantly outperforms the same model pretrained with full-supervision on ImageNet and Kinetics for action recognition on HMDB51 and UCF101. To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning
    Das, Srijan
    Ryoo, Michael
    2023 18TH INTERNATIONAL CONFERENCE ON MACHINE VISION AND APPLICATIONS, MVA, 2023,
  • [2] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
    Sarkar, Pritam
    Etemad, Ali
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9723 - 9732
  • [3] Self-Supervised Correlation Learning for Cross-Modal Retrieval
    Liu, Yaxin
    Wu, Jianlong
    Qu, Leigang
    Gan, Tian
    Yin, Jianhua
    Nie, Liqiang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2851 - 2863
  • [4] SELF-SUPERVISED LEARNING WITH CROSS-MODAL TRANSFORMERS FOR EMOTION RECOGNITION
    Khare, Aparna
    Parthasarathy, Srinivas
    Sundaram, Shiva
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 381 - 388
  • [5] Self-supervised incomplete cross-modal hashing retrieval
    Peng, Shouyong
    Yao, Tao
    Li, Ying
    Wang, Gang
    Wang, Lili
    Yan, Zhiming
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 262
  • [6] Self-Supervised Visual Representations for Cross-Modal Retrieval
    Patel, Yash
    Gomez, Lluis
    Rusinol, Marcal
    Karatzas, Dimosthenis
    Jawahar, C., V
    ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 182 - 186
  • [7] POSITIVE AND NEGATIVE SAMPLING STRATEGIES FOR SELF-SUPERVISED LEARNING ON AUDIO-VIDEO DATA<bold> </bold>
    Wang, Shanshan
    Tripathy, Soumya
    Heittola, Toni
    Mesaros, Annamaria
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 545 - 549
  • [8] CCMA: CapsNet for audio-video sentiment analysis using cross-modal attention
    Li, Haibin
    Guo, Aodi
    Li, Yaqian
    VISUAL COMPUTER, 2025, 41 (03): : 1609 - 1620
  • [9] Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning
    Salvador, Amaia
    Gundogdu, Erhan
    Bazzani, Loris
    Donoser, Michael
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15470 - 15479
  • [10] Learning Mutual Modulation for Self-supervised Cross-Modal Super-Resolution
    Dong, Xiaoyu
    Yokoya, Naoto
    Wang, Longguang
    Uezato, Tatsumi
    COMPUTER VISION, ECCV 2022, PT XIX, 2022, 13679 : 1 - 18