Audio-visual localization based on spatial relative sound order

被引:0
|
作者
Tomoya Sato [1 ]
Yusuke Sugano [1 ]
Yoichi Sato [1 ]
机构
[1] Institute of Industrial Science,
[2] The University of Tokyo,undefined
关键词
Computer vision; Machine learning; Audio-visual learning; Sound localization;
D O I
10.1007/s00138-025-01700-0
中图分类号
学科分类号
摘要
Sound localization is one of the essential tasks in audio-visual learning. Especially, stereo sound localization methods have been proposed to handle multiple sound sources. However, existing stereo-sound localization methods treat sound source localization as a segmentation task and, as a result, require costly annotation of segmentation masks. Another serious problem of the existing stereo-sound localization methods is that they have been trained and evaluated only in a controlled environment, such as a fixed camera and microphone setting with limited variability of scenes. Therefore, their performance on videos recorded in uncontrolled environments, such as in-the-wild videos from the Internet, has not been fully investigated. To address these problems, we propose a weakly supervised method as an extension of a typical stereo-sound localization method by utilizing the spatial relative order of sound sources in recorded videos. The proposed method solves the annotation problem by training the localization model using only sound category labels. Furthermore, our method utilizes the spatial relative order of the sound sources, which is not affected by specific recording settings, and thus can be effectively used for videos recorded in uncontrolled environments. We also collect stereo-recorded videos from YouTube to construct a new dataset to demonstrate the applicability of the proposed method to stereo sounds recorded in various environments. Our method enhances the localization performance by inserting a novel training step that exploits the relative order of sound sources into a typical audio-visual localization method in both existing and newly introduced audio-visual datasets.
引用
收藏
相关论文
共 50 条
  • [1] Perceptual thresholds of audio-visual spatial coherence for a variety of audio-visual objects
    Stenzel, Hanne
    Jackson, Philip J. B.
    2018 AES INTERNATIONAL CONFERENCE ON AUDIO FOR VIRTUAL AND AUGMENTED REALITY, 2018,
  • [2] Feedback Modulates Audio-Visual Spatial Recalibration
    Kramer, Alexander
    Roeder, Brigitte
    Bruns, Patrick
    FRONTIERS IN INTEGRATIVE NEUROSCIENCE, 2020, 13
  • [3] Semantic and Relation Modulation for Audio-Visual Event Localization
    Wang, Hao
    Zha, Zheng-Jun
    Li, Liang
    Chen, Xuejin
    Luo, Jiebo
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7711 - 7725
  • [4] Probabilistic speaker localization in noisy enviromments by audio-visual integration
    Choi, Jong-Suk
    Kim, Munsang
    Kim, Hyun-Don
    2006 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, VOLS 1-12, 2006, : 4704 - +
  • [5] Masked co-attention model for audio-visual event localization
    Liu, Hengwei
    Gu, Xiaodong
    APPLIED INTELLIGENCE, 2024, 54 (02) : 1691 - 1705
  • [6] Masked co-attention model for audio-visual event localization
    Hengwei Liu
    Xiaodong Gu
    Applied Intelligence, 2024, 54 : 1691 - 1705
  • [7] Spherical World-Locking for Audio-Visual Localization in Egocentric Videos
    Yun, Heeseung
    Gao, Ruohan
    Ananthabhotla, Ishwarya
    Kumar, Anurag
    Donley, Jacob
    Li, Chao
    Kim, Gunhee
    Ithapu, Vamsi Krishna
    Murdock, Calvin
    COMPUTER VISION - ECCV 2024, PT XXIV, 2025, 15082 : 256 - 274
  • [8] Audio-Visual Segmentation with Semantics
    Zhou, Jinxing
    Shen, Xuyang
    Wang, Jianyuan
    Zhang, Jiayi
    Sun, Weixuan
    Zhang, Jing
    Birchfield, Stan
    Guo, Dan
    Kong, Lingpeng
    Wang, Meng
    Zhong, Yiran
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, : 1644 - 1664
  • [9] The ventriloquist in periphery: Impact of eccentricity-related reliability on audio-visual localization
    Charbonneau, Genevieve
    Veronneau, Marie
    Boudrias-Fournier, Colin
    Lepore, Franco
    Collignon, Olivier
    JOURNAL OF VISION, 2013, 13 (12): : 20
  • [10] Deep audio-visual speech separation based on facial motion
    Rigal, Remi
    Chodorowski, Jacques
    Zerr, Benoit
    INTERSPEECH 2021, 2021, : 3540 - 3544