Audio-Visual Event Localization by Learning Spatial and Semantic Co-Attention

被引:18
|
作者
Xue, Cheng [1 ]
Zhong, Xionghu [1 ]
Cai, Minjie [1 ]
Chen, Hao [1 ]
Wang, Wenwu [2 ]
机构
[1] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Peoples R China
[2] Univ Surrey, Ctr Vis Speech & Signal Proc, Dept Elect & Elect Engn, Guildford GU2 7XH, England
基金
中国国家自然科学基金;
关键词
Visualization; Location awareness; Task analysis; Semantics; Feature extraction; Correlation; Automobiles; Audio-visual; event localization; cross-modal; co-attention; deep learning;
D O I
10.1109/TMM.2021.3127029
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This work aims to temporally localize events that are both audible and visible in video. Previous methods mainly focused on temporal modeling of events with simple fusion of audio and visual features. In natural scenes, a video records not only the events of interest but also ambient acoustic noise and visual background, resulting in redundant information in the raw audio and visual features. Thus, direct fusion of the two features often causes false localization of the events. In this paper, we propose a co-attention model to exploit the spatial and semantic correlations between the audio and visual features, which helps guide the extraction of discriminative features for better event localization. Our assumption is that in an audio-visual event, shared semantic information between audio and visual features exists and can be extracted by attention learning. Specifically, the proposed co-attention model is composed of a co-spatial attention module and a co-semantic attention module that are used to model the spatial and semantic correlations, respectively. The proposed co-attention model can be applied to various event localization tasks, such as cross-modality localization and multimodal event localization. Experiments on the public audio-visual event (AVE) dataset demonstrate that the proposed method achieves state-of-the-art performance by learning spatial and semantic co-attention.
引用
收藏
页码:418 / 429
页数:12
相关论文
共 50 条
  • [1] Masked co-attention model for audio-visual event localization
    Hengwei Liu
    Xiaodong Gu
    Applied Intelligence, 2024, 54 : 1691 - 1705
  • [2] Masked co-attention model for audio-visual event localization
    Liu, Hengwei
    Gu, Xiaodong
    APPLIED INTELLIGENCE, 2024, 54 (02) : 1691 - 1705
  • [3] Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention
    Duan, Bin
    Tang, Hao
    Wang, Wei
    Zong, Ziliang
    Yang, Guowei
    Yan, Yan
    2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, : 4012 - 4021
  • [4] Semantic and Relation Modulation for Audio-Visual Event Localization
    Wang, Hao
    Zha, Zheng-Jun
    Li, Liang
    Chen, Xuejin
    Luo, Jiebo
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7711 - 7725
  • [5] Dual Attention Matching for Audio-Visual Event Localization
    Wu, Yu
    Zhu, Linchao
    Yan, Yan
    Yang, Yi
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6301 - 6309
  • [6] Learning Event-Specific Localization Preferences for Audio-Visual Event Localization
    Ge, Shiping
    Jiang, Zhiwei
    Yin, Yafeng
    Wang, Cong
    Cheng, Zifeng
    Gu, Qing
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3446 - 3454
  • [7] Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
    Cheng, Ying
    Wang, Ruize
    Pan, Zhihao
    Feng, Rui
    Zhang, Yuejie
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3884 - 3892
  • [8] Audio-visual event detection based on mining of semantic audio-visual labels
    Goh, KS
    Miyahara, K
    Radhakrishan, R
    Xiong, ZY
    Divakaran, A
    STORAGE AND RETRIEVAL METHODS AND APPLICATIONS FOR MULTIMEDIA 2004, 2004, 5307 : 292 - 299
  • [9] Temporal Cross-Modal Attention for Audio-Visual Event Localization
    Nagasaki Y.
    Hayashi M.
    Kaneko N.
    Aoki Y.
    Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering, 2022, 88 (03): : 263 - 268
  • [10] Dynamic interactive learning network for audio-visual event localization
    Chen, Jincai
    Liang, Han
    Wang, Ruili
    Zeng, Jiangfeng
    Lu, Ping
    APPLIED INTELLIGENCE, 2023, 53 (24) : 30431 - 30442