Audio-visual localization based on spatial relative sound order

被引:0
作者
Tomoya Sato [1 ]
Yusuke Sugano [1 ]
Yoichi Sato [1 ]
机构
[1] Institute of Industrial Science,
[2] The University of Tokyo,undefined
关键词
Computer vision; Machine learning; Audio-visual learning; Sound localization;
D O I
10.1007/s00138-025-01700-0
中图分类号
学科分类号
摘要
Sound localization is one of the essential tasks in audio-visual learning. Especially, stereo sound localization methods have been proposed to handle multiple sound sources. However, existing stereo-sound localization methods treat sound source localization as a segmentation task and, as a result, require costly annotation of segmentation masks. Another serious problem of the existing stereo-sound localization methods is that they have been trained and evaluated only in a controlled environment, such as a fixed camera and microphone setting with limited variability of scenes. Therefore, their performance on videos recorded in uncontrolled environments, such as in-the-wild videos from the Internet, has not been fully investigated. To address these problems, we propose a weakly supervised method as an extension of a typical stereo-sound localization method by utilizing the spatial relative order of sound sources in recorded videos. The proposed method solves the annotation problem by training the localization model using only sound category labels. Furthermore, our method utilizes the spatial relative order of the sound sources, which is not affected by specific recording settings, and thus can be effectively used for videos recorded in uncontrolled environments. We also collect stereo-recorded videos from YouTube to construct a new dataset to demonstrate the applicability of the proposed method to stereo sounds recorded in various environments. Our method enhances the localization performance by inserting a novel training step that exploits the relative order of sound sources into a typical audio-visual localization method in both existing and newly introduced audio-visual datasets.
引用
收藏
相关论文
共 50 条
  • [21] Emotion Detection in Multimodal Communication through Audio-Visual Gesture Analysis
    Minu, R., I
    Aamuktha, Divya Sai P.
    Ishita, B.
    Anubhav, P.
    Kumar, Tanishq
    Jayaram, Ramaprabha
    Karjee, Jyotirmoy
    10TH INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTING AND COMMUNICATION TECHNOLOGIES, CONECCT 2024, 2024,
  • [22] Longitudinal tear detection method of conveyor belt based on audio-visual fusion
    Che, Jian
    Qiao, Tiezhu
    Yang, Yi
    Zhang, Haitao
    Pang, Yusong
    MEASUREMENT, 2021, 176
  • [23] MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing
    Yu, Jiashuo
    Cheng, Ying
    Zhao, Rui-Wei
    Feng, Rui
    Zhang, Yuejie
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 6241 - 6249
  • [24] Audio-Visual Interference During Motion Discrimination in Starlings
    Feenders, Gesa
    Klump, Georg M.
    MULTISENSORY RESEARCH, 2023, 36 (02) : 181 - 212
  • [25] A review of tools and techniques for audio-visual assessment of urbanscape
    Vipul Parmar
    Arnab Jana
    Discover Cities, 1 (1):
  • [26] A Combined Rule-Based & Machine Learning Audio-Visual Emotion Recognition Approach
    Seng, Kah Phooi
    Ang, Li-Minn
    Ooi, Chien Shing
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2018, 9 (01) : 3 - 13
  • [27] Audiovisual Spatial-Audio Analysis by Means of Sound Localization and Imaging: A Multimedia Healthcare Framework in Abdominal Sound Mapping
    Dimoulas, Charalampos A.
    IEEE TRANSACTIONS ON MULTIMEDIA, 2016, 18 (10) : 1969 - 1976
  • [28] Audio-visual self-supervised representation learning: A survey
    Alsuwat, Manal
    Al-Shareef, Sarah
    Alghamdi, Manal
    NEUROCOMPUTING, 2025, 634
  • [29] Listen to Look Into the Future: Audio-Visual Egocentric Gaze Anticipation
    Lai, Bolin
    Ryan, Fiona
    Jia, Wenqi
    Liu, Miao
    Rehg, James M.
    COMPUTER VISION - ECCV 2024, PT IX, 2025, 15067 : 192 - 210
  • [30] Noise-Tolerant Learning for Audio-Visual Action Recognition
    Han, Haochen
    Zheng, Qinghua
    Luo, Minnan
    Miao, Kaiyao
    Tian, Feng
    Chen, Yan
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 7761 - 7774