Learning Self-supervised Audio-Visual Representations for Sound Recommendations

被引：1

作者：

Krishnamurthy, Sudha ^{[1
]}

机构：

[1] Sony Interact Entertainment, San Mateo, CA 94404 USA

来源：

ADVANCES IN VISUAL COMPUTING (ISVC 2021), PT II | 2021年 / 13018卷

关键词：

Self-supervision; Representation learning; Cross-modal correlation;

D O I：

10.1007/978-3-030-90436-4_10

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We propose a novel self-supervised approach for learning audio and visual representations from unlabeled videos, based on their correspondence. The approach uses an attention mechanism to learn the relative importance of convolutional features extracted at different resolutions from the audio and visual streams and uses the attention features to encode the audio and visual input based on their correspondence. We evaluated the representations learned by the model to classify audio-visual correlation as well as to recommend sound effects for visual scenes. Our results show that the representations generated by the attention model improves the correlation accuracy compared to the baseline, by 18% and the recommendation accuracy by 10% for VGG-Sound, which is a public video dataset. Additionally, audio-visual representations learned by training the attention model with cross-modal contrastive learning further improves the recommendation performance, based on our evaluation using VGG-Sound and a more challenging dataset consisting of gameplay video recordings.

引用

页码：124 / 138

页数：15

共 50 条

[21] Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling
Masuyama, Yoshiki
Bando, Yoshiaki
Yatabe, Kohei
Sasaki, Yoko
Onishi, Masaki
Oikawa, Yasuhiro
2020 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2020, : 4848 - 4854
[22] Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection
Kim, Ui-Hyun
INTERSPEECH 2021, 2021, : 326 - 330
[23] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
Sarkar, Pritam
Etemad, Ali
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9723 - 9732
[24] Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization
Liu, Tianyu
Zhang, Peng
Huang, Wei
Zha, Yufei
You, Tao
Zhang, Yanning
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4042 - 4052
[25] Self-Supervised Moving Vehicle Detection From Audio-Visual Cues
Zuern, Jannik
Burgard, Wolfram
IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (03) : 7415 - 7422
[26] Learning Action Representations for Self-supervised Visual Exploration
Oh, Changjae
Cavallaro, Andrea
2019 INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2019, : 5873 - 5879
[27] Learning Representations for New Sound Classes With Continual Self-Supervised Learning
Wang, Zhepei
Subakan, Cem
Jiang, Xilin
Wu, Junkai
Tzinis, Efthymios
Ravanelli, Mirco
Smaragdis, Paris
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2607 - 2611
[28] Audio-Visual Self-Supervised Terrain Type Recognition for Ground Mobile Platforms
Kurobe, Akiyoshi
Nakajima, Yoshikatsu
Kitani, Kris
Saito, Hideo
IEEE ACCESS, 2021, 9 : 29970 - 29979
[29] Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking
Li, Yidi
Liu, Hong
Tang, Hao
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1456 - 1463
[30] Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
Cheng, Ying
Wang, Ruize
Pan, Zhihao
Feng, Rui
Zhang, Yuejie
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3884 - 3892

← 1 2 3 4 5 →