EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

被引:231
作者
Kazakos, Evangelos [1 ]
Nagrani, Arsha [2 ]
Zisserman, Andrew [2 ]
Damen, Dima [1 ]
机构
[1] Univ Bristol, Visual Informat Lab, Bristol, Avon, England
[2] Univ Oxford, Visual Geometry Grp, Oxford, England
来源
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) | 2019年
基金
英国工程与自然科学研究理事会;
关键词
D O I
10.1109/ICCV.2019.00559
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We focus on multi-modal fusion for egocentric action recognition, and propose a novel architecture for multi-modal temporal-binding, i.e. the combination of modalities within a range of temporal offsets. We train the architecture with three modalities - RGB, Flow and Audio - and combine them with mid-level fusion alongside sparse temporal sampling of fused representations. In contrast with previous works, modalities are fused before temporal aggregation, with shared modality and fusion weights over time. Our proposed architecture is trained end-to-end, outperforming individual modalities as well as late-fusion of modalities. We demonstrate the importance of audio in egocentric vision, on per-class basis, for identifying actions as well as interacting objects. Our method achieves state of the art results on both the seen and unseen test sets of the largest egocentric dataset: EPIC-Kitchens, on all metrics using the public leaderboard.
引用
收藏
页码:5491 / 5500
页数:10
相关论文
共 46 条
[31]  
Pirsiavash H, 2012, PROC CVPR IEEE, P2847, DOI 10.1109/CVPR.2012.6248010
[32]   On the momentum term in gradient descent learning algorithms [J].
Qian, N .
NEURAL NETWORKS, 1999, 12 (01) :145-151
[33]   Learning to Localize Sound Source in Visual Scenes [J].
Senocak, Arda ;
Oh, Tae-Hyun ;
Kim, Junsik ;
Yang, Ming-Hsuan ;
Kweon, In So .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4358-4366
[34]  
Sigurdsson Gunnar A., 2018, CVPR
[35]  
Simonyan K, 2015, Arxiv, DOI arXiv:1409.1556
[36]  
Singh Suriya., 2016, CVPR, DOI DOI 10.1109/CVPR.2016.287
[37]  
Song S., 2016, CVPRW
[38]   The effects of visual training on multisensory temporal processing [J].
Stevenson, Ryan A. ;
Wilson, Magdalena M. ;
Powers, Albert R. ;
Wallace, Mark T. .
EXPERIMENTAL BRAIN RESEARCH, 2013, 225 (04) :479-489
[39]  
Sudhakaran S., 2018, BMVC, P1
[40]   The construct of the multisensory temporal binding window and its dysregulation in developmental disabilities [J].
Wallace, Mark T. ;
Stevenson, Ryan A. .
NEUROPSYCHOLOGIA, 2014, 64 :105-123