Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

被引:0
|
作者
Yun, Heeseung [1 ,2 ]
Gao, Ruohan [2 ]
Ananthabhotla, Ishwarya [2 ]
Kumar, Anurag [2 ]
Donley, Jacob [2 ]
Li, Chao [2 ]
Kim, Gunhee [1 ]
Ithapu, Vamsi Krishna [2 ]
Murdock, Calvin [2 ]
机构
[1] Seoul Natl Univ, Seoul, South Korea
[2] Meta, Real Labs Res, Redmond, WA USA
来源
COMPUTER VISION - ECCV 2024, PT XXIV | 2025年 / 15082卷
关键词
Egocentric Vision; Audio-Visual Learning; HEAD MOVEMENTS; SOUND;
D O I
10.1007/978-3-031-72691-0_15
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Egocentric videos provide comprehensive contexts for user and scene understanding, spanning multisensory perception to behavioral interaction. We propose Spherical World-Locking (SWL) as a general framework for egocentric scene representation, which implicitly transforms multisensory streams with respect to measurements of head orientation. Compared to conventional head-locked egocentric representations with a 2D planar field-of-view, SWL effectively offsets challenges posed by self-motion, allowing for improved spatial synchronization between input modalities. Using a set of multisensory embeddings on a world-locked sphere, we design a unified encoder-decoder transformer architecture that preserves the spherical structure of the scene representation, without requiring expensive projections between image and world coordinate systems. We evaluate the effectiveness of the proposed framework on multiple benchmark tasks for egocentric video understanding, including audio-visual active speaker localization, auditory spherical source localization, and behavior anticipation in everyday activities.
引用
收藏
页码:256 / 274
页数:19
相关论文
共 13 条
  • [1] Listen to Look Into the Future: Audio-Visual Egocentric Gaze Anticipation
    Lai, Bolin
    Ryan, Fiona
    Jia, Wenqi
    Liu, Miao
    Rehg, James M.
    COMPUTER VISION - ECCV 2024, PT IX, 2025, 15067 : 192 - 210
  • [2] Semantic and Relation Modulation for Audio-Visual Event Localization
    Wang, Hao
    Zha, Zheng-Jun
    Li, Liang
    Chen, Xuejin
    Luo, Jiebo
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7711 - 7725
  • [3] Audio-visual object removal in 360-degree videos
    Shimamura, Ryo
    Feng, Qi
    Koyama, Yuki
    Nakatsuka, Takayuki
    Fukayama, Satoru
    Hamasaki, Masahiro
    Goto, Masataka
    Morishima, Shigeo
    VISUAL COMPUTER, 2020, 36 (10-12) : 2117 - 2128
  • [4] Audio-Visual Localization by Synthetic Acoustic Image Generation
    Sanguineti, Valentina
    Morerio, Pietro
    Del Bue, Alessio
    Murino, Vittorio
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2523 - 2531
  • [5] Dense Modality Interaction Network for Audio-Visual Event Localization
    Liu, Shuo
    Quan, Weize
    Wang, Chaoqun
    Liu, Yuan
    Liu, Bin
    Yan, Dong-Ming
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2734 - 2748
  • [6] Audio-visual localization based on spatial relative sound order
    Tomoya Sato
    Yusuke Sugano
    Yoichi Sato
    Machine Vision and Applications, 2025, 36 (4)
  • [7] Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds
    Sato, Tomoya
    Sugano, Yusuke
    Sato, Yoichi
    IEEE ACCESS, 2022, 10 : 94273 - 94284
  • [8] A Closer Look at Weakly-Supervised Audio-Visual Source Localization
    Mo, Shentong
    Morgado, Pedro
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [9] Leveraging the Video-Level Semantic Consistency of Event for Audio-Visual Event Localization
    Jiang, Yuanyuan
    Yin, Jianqin
    Dang, Yonghao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4617 - 4627
  • [10] Self-Supervised Video Representation and Temporally Adaptive Attention for Audio-Visual Event Localization
    Ran, Yue
    Tang, Hongying
    Li, Baoqing
    Wang, Guohui
    APPLIED SCIENCES-BASEL, 2022, 12 (24):