Dual Attention Matching for Audio-Visual Event Localization

被引:147
作者
Wu, Yu [1 ,2 ]
Zhu, Linchao [2 ]
Yan, Yan [3 ]
Yang, Yi [2 ]
机构
[1] Baidu Res, Beijing, Peoples R China
[2] Univ Technol Sydney, ReLER, Sydney, NSW, Australia
[3] Texas State Univ, San Marcos, TX USA
来源
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) | 2019年
关键词
D O I
10.1109/ICCV.2019.00639
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we investigate the audio-visual event localization problem. This task is to localize a visible and audible event in a video. Previous methods first divide a video into short segments, and then fuse visual and acoustic features at the segment level. The duration of these segments is usually short, making the visual and acoustic feature of each segment possibly not well aligned. Direct concatenation of the two features at the segment level can be vulnerable to a minor temporal misalignment of the two signals. We propose a Dual Attention Matching (DAM) module to cover a longer video duration for better high-level event information modeling, while the local temporal information is attained by the global cross-check mechanism. Our premise is that one should watch the whole video to understand the high-level event, while shorter segments should be checked in detail for localization. Specifically, the global feature of one modality queries the local feature in the other modality in a bi-directional way. With temporal co-occurrence encoded between auditory and visual signals, DAM can be readily applied in various audio-visual event localization tasks, e.g., cross-modality localization, supervised event localization. Experiments on the AVE dataset show our method outperforms the state-of-the-art by a large margin.
引用
收藏
页码:6301 / 6309
页数:9
相关论文
共 34 条
  • [1] Audio Visual Scene-Aware Dialog
    Alamri, Huda
    Cartillier, Vincent
    Das, Abhishek
    Wang, Jue
    Cherian, Anoop
    Essa, Irfan
    Batra, Dhruv
    Marks, Tim K.
    Hori, Chiori
    Anderson, Peter
    Lee, Stefan
    Parikh, Devi
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 7550 - 7559
  • [2] Andrew G., 2013, PMLR, P1247
  • [3] [Anonymous], 2016, NIPS
  • [4] Look, Listen and Learn
    Arandjelovic, Relja
    Zisserman, Andrew
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 609 - 617
  • [5] Aytar Y., 2017, ARXIV170600932
  • [6] Aytar Y, 2016, ADV NEUR IN, V29
  • [7] Chen Z, 2017, INT CONF ACOUST SPEE, P246, DOI 10.1109/ICASSP.2017.7952155
  • [8] Cho K., 2014, P SSST8 8 WORKSH SYN, P103, DOI 10.3115/v1/w14-4012
  • [9] Learning to Separate Object Sounds by Watching Unlabeled Video
    Gao, Ruohan
    Feris, Rogerio
    Grauman, Kristen
    [J]. COMPUTER VISION - ECCV 2018, PT III, 2018, 11207 : 36 - 54
  • [10] Gemmeke JF, 2017, INT CONF ACOUST SPEE, P776, DOI 10.1109/ICASSP.2017.7952261