Egocentric Audio-Visual Object Localization

被引:11
作者
Huang, Chao [1 ]
Flan, Yapeng [1 ]
Kurnar, Anurag [2 ]
Xu, Chenliang [1 ]
机构
[1] Univ Rochester, Rochester, NY 14627 USA
[2] Meta Real Labs Res, Redmond, WA USA
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
关键词
D O I
10.1109/CVPR52729.2023.02194
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Humans naturally perceive surrounding scenes by unifying sound and sight from a first-person view. Likewise, machines are advanced to approach human intelligence by learning with multisensory inputs from an egocentric perspective. In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created when wearers shift their attention. To address the first problem, we propose a geometry-aware temporal aggregation module that handles the egomotion explicitly. The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations. Moreover, we propose a cascaded feature enhancement module to overcome the second issue. It improves cross-modal localization robustness by disentangling visually-indicated audio representation. During training, we take advantage of the naturally occurring audio-visual temporal synchronization as the "free" self-supervision to avoid costly labeling. We also annotate and create the Epic Sounding Object dataset for evaluation purposes. Extensive experiments show that our method achieves state-of-the-art localization performance in egocentric videos and can be generalized to diverse audio-visual scenes. Code is available at https://github.com/WikiChao/Ego-AV-Loc.
引用
收藏
页码:22910 / 22921
页数:12
相关论文
共 89 条
[1]   When will you do what? - Anticipating Temporal Occurrences of Activities [J].
Abu Farha, Yazan ;
Richard, Alexander ;
Gall, Juergen .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5343-5352
[2]  
Afouras Triantafyllos, 2020, LNCS, DOI DOI 10.1007/978-3-030-58523-5_13
[3]  
[Anonymous], 2022, P IEEE CVF C COMP VI, DOI DOI 10.1002/CPE.7048
[4]   Look, Listen and Learn [J].
Arandjelovic, Relja ;
Zisserman, Andrew .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :609-617
[5]  
Arandjelovic R., 2018, P EUROPEAN C COMPUTE, P435
[6]  
Aytar Y, 2016, ADV NEUR IN, V29
[7]   Seeing sounds: visual and auditory interactions in the brain [J].
Bulkin, David A. ;
Groh, Jennifer M. .
CURRENT OPINION IN NEUROBIOLOGY, 2006, 16 (04) :415-419
[8]  
Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[9]  
Cai MJ, 2016, ROBOTICS: SCIENCE AND SYSTEMS XII
[10]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733