Audio-visual localization based on spatial relative sound order

被引:0
作者
Sato, Tomoya [1 ]
Sugano, Yusuke [1 ]
Sato, Yoichi [1 ]
机构
[1] Univ Tokyo, Inst Ind Sci, Tokyo, Japan
关键词
Computer vision; Machine learning; Audio-visual learning; Sound localization;
D O I
10.1007/s00138-025-01700-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sound localization is one of the essential tasks in audio-visual learning. Especially, stereo sound localization methods have been proposed to handle multiple sound sources. However, existing stereo-sound localization methods treat sound source localization as a segmentation task and, as a result, require costly annotation of segmentation masks. Another serious problem of the existing stereo-sound localization methods is that they have been trained and evaluated only in a controlled environment, such as a fixed camera and microphone setting with limited variability of scenes. Therefore, their performance on videos recorded in uncontrolled environments, such as in-the-wild videos from the Internet, has not been fully investigated. To address these problems, we propose a weakly supervised method as an extension of a typical stereo-sound localization method by utilizing the spatial relative order of sound sources in recorded videos. The proposed method solves the annotation problem by training the localization model using only sound category labels. Furthermore, our method utilizes the spatial relative order of the sound sources, which is not affected by specific recording settings, and thus can be effectively used for videos recorded in uncontrolled environments. We also collect stereo-recorded videos from YouTube to construct a new dataset to demonstrate the applicability of the proposed method to stereo sounds recorded in various environments. Our method enhances the localization performance by inserting a novel training step that exploits the relative order of sound sources into a typical audio-visual localization method in both existing and newly introduced audio-visual datasets.
引用
收藏
页数:14
相关论文
共 50 条
[31]   Audio-visual self-supervised representation learning: A survey [J].
Alsuwat, Manal ;
Al-Shareef, Sarah ;
Alghamdi, Manal .
NEUROCOMPUTING, 2025, 634
[32]   Noise-Tolerant Learning for Audio-Visual Action Recognition [J].
Han, Haochen ;
Zheng, Qinghua ;
Luo, Minnan ;
Miao, Kaiyao ;
Tian, Feng ;
Chen, Yan .
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 :7761-7774
[33]   Stress Detection and Audio-Visual Stimuli Classification From Electroencephalogram [J].
Ghosh Troyee, Trishita ;
Hasan Chowdhury, Mehdi ;
Khondakar, Md. Fazlul Karim ;
Hasan, Mahmudul ;
Hossain, Md. Azad ;
Delwar Hossain, Quazi ;
Ali Akber Dewan, M. .
IEEE ACCESS, 2024, 12 :145417-145427
[34]   Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices [J].
Ryumin, Dmitry ;
Ivanko, Denis ;
Ryumina, Elena .
SENSORS, 2023, 23 (04)
[35]   Development of a Bayesian Estimator for Audio-Visual Integration: A Neurocomputational Study [J].
Ursino, Mauro ;
Crisafulli, Andrea ;
di Pellegrino, Giuseppe ;
Magosso, Elisa ;
Cuppini, Cristiano .
FRONTIERS IN COMPUTATIONAL NEUROSCIENCE, 2017, 11
[36]   Audio-Visual Multimodal Deepfake Detection Leveraging Emotional Recognition [J].
Alsaeedi, Alaa ;
AlMansour, Amal ;
Jamal, Amani .
International Journal of Advanced Computer Science and Applications, 2025, 16 (06) :213-226
[37]   QoE Estimation of WebRTC-based Audio-visual Conversations from Facial and Speech Features [J].
Bingol, Gulnaziye ;
Porcu, Simone ;
Floris, Alessandro ;
Atzori, Luigi .
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (05)
[38]   DEEP VIDEO INPAINTING GUIDED BY AUDIO-VISUAL SELF-SUPERVISION [J].
Kim, Kyuyeon ;
Jung, Junsik ;
Kim, Woo Jae ;
Yoon, Sung-Eui .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :1970-1974
[39]   A3CarScene: An audio-visual dataset for driving scene understanding [J].
Cantarini, Michela ;
Gabrielli, Leonardo ;
Mancini, Adriano ;
Squartini, Stefano ;
Longo, Roberto .
DATA IN BRIEF, 2023, 48
[40]   LUMINA: Linguistic unified multimodal Indonesian natural audio-visual dataset [J].
Setyaningsih, Eka Rahayu ;
Handayani, Anik Nur ;
Irianto, Wahyu Sakti Gunawan ;
Kristian, Yosi ;
Chen, Christian Trisno Sen Long .
DATA IN BRIEF, 2024, 54