Learning Self-supervised Audio-Visual Representations for Sound Recommendations

被引:1
|
作者
Krishnamurthy, Sudha [1 ]
机构
[1] Sony Interact Entertainment, San Mateo, CA 94404 USA
关键词
Self-supervision; Representation learning; Cross-modal correlation;
D O I
10.1007/978-3-030-90436-4_10
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a novel self-supervised approach for learning audio and visual representations from unlabeled videos, based on their correspondence. The approach uses an attention mechanism to learn the relative importance of convolutional features extracted at different resolutions from the audio and visual streams and uses the attention features to encode the audio and visual input based on their correspondence. We evaluated the representations learned by the model to classify audio-visual correlation as well as to recommend sound effects for visual scenes. Our results show that the representations generated by the attention model improves the correlation accuracy compared to the baseline, by 18% and the recommendation accuracy by 10% for VGG-Sound, which is a public video dataset. Additionally, audio-visual representations learned by training the attention model with cross-modal contrastive learning further improves the recommendation performance, based on our evaluation using VGG-Sound and a more challenging dataset consisting of gameplay video recordings.
引用
收藏
页码:124 / 138
页数:15
相关论文
共 50 条
  • [21] Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling
    Masuyama, Yoshiki
    Bando, Yoshiaki
    Yatabe, Kohei
    Sasaki, Yoko
    Onishi, Masaki
    Oikawa, Yasuhiro
    2020 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2020, : 4848 - 4854
  • [22] Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection
    Kim, Ui-Hyun
    INTERSPEECH 2021, 2021, : 326 - 330
  • [23] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
    Sarkar, Pritam
    Etemad, Ali
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9723 - 9732
  • [24] Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization
    Liu, Tianyu
    Zhang, Peng
    Huang, Wei
    Zha, Yufei
    You, Tao
    Zhang, Yanning
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4042 - 4052
  • [25] Self-Supervised Moving Vehicle Detection From Audio-Visual Cues
    Zuern, Jannik
    Burgard, Wolfram
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (03) : 7415 - 7422
  • [26] Learning Action Representations for Self-supervised Visual Exploration
    Oh, Changjae
    Cavallaro, Andrea
    2019 INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2019, : 5873 - 5879
  • [27] Learning Representations for New Sound Classes With Continual Self-Supervised Learning
    Wang, Zhepei
    Subakan, Cem
    Jiang, Xilin
    Wu, Junkai
    Tzinis, Efthymios
    Ravanelli, Mirco
    Smaragdis, Paris
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2607 - 2611
  • [28] Audio-Visual Self-Supervised Terrain Type Recognition for Ground Mobile Platforms
    Kurobe, Akiyoshi
    Nakajima, Yoshikatsu
    Kitani, Kris
    Saito, Hideo
    IEEE ACCESS, 2021, 9 : 29970 - 29979
  • [29] Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking
    Li, Yidi
    Liu, Hong
    Tang, Hao
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1456 - 1463
  • [30] Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
    Cheng, Ying
    Wang, Ruize
    Pan, Zhihao
    Feng, Rui
    Zhang, Yuejie
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3884 - 3892