Audio-Visual Event Localization by Learning Spatial and Semantic Co-Attention

被引：18

作者：

Xue, Cheng ^{[1
]}

Zhong, Xionghu ^{[1
]}

Cai, Minjie ^{[1
]}

Chen, Hao ^{[1
]}

Wang, Wenwu ^{[2
]}

机构：

[1] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Peoples R China

[2] Univ Surrey, Ctr Vis Speech & Signal Proc, Dept Elect & Elect Engn, Guildford GU2 7XH, England

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2023年 / 25卷

基金：

中国国家自然科学基金;

关键词：

Visualization; Location awareness; Task analysis; Semantics; Feature extraction; Correlation; Automobiles; Audio-visual; event localization; cross-modal; co-attention; deep learning;

D O I：

10.1109/TMM.2021.3127029

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This work aims to temporally localize events that are both audible and visible in video. Previous methods mainly focused on temporal modeling of events with simple fusion of audio and visual features. In natural scenes, a video records not only the events of interest but also ambient acoustic noise and visual background, resulting in redundant information in the raw audio and visual features. Thus, direct fusion of the two features often causes false localization of the events. In this paper, we propose a co-attention model to exploit the spatial and semantic correlations between the audio and visual features, which helps guide the extraction of discriminative features for better event localization. Our assumption is that in an audio-visual event, shared semantic information between audio and visual features exists and can be extracted by attention learning. Specifically, the proposed co-attention model is composed of a co-spatial attention module and a co-semantic attention module that are used to model the spatial and semantic correlations, respectively. The proposed co-attention model can be applied to various event localization tasks, such as cross-modality localization and multimodal event localization. Experiments on the public audio-visual event (AVE) dataset demonstrate that the proposed method achieves state-of-the-art performance by learning spatial and semantic co-attention.

引用

页码：418 / 429

页数：12

共 50 条

[1] Masked co-attention model for audio-visual event localization
Hengwei Liu
Xiaodong Gu
Applied Intelligence, 2024, 54 : 1691 - 1705
[2] Masked co-attention model for audio-visual event localization
Liu, Hengwei
Gu, Xiaodong
APPLIED INTELLIGENCE, 2024, 54 (02) : 1691 - 1705
[3] Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention
Duan, Bin
Tang, Hao
Wang, Wei
Zong, Ziliang
Yang, Guowei
Yan, Yan
2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, : 4012 - 4021
[4] Semantic and Relation Modulation for Audio-Visual Event Localization
Wang, Hao
Zha, Zheng-Jun
Li, Liang
Chen, Xuejin
Luo, Jiebo
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7711 - 7725
[5] Dual Attention Matching for Audio-Visual Event Localization
Wu, Yu
Zhu, Linchao
Yan, Yan
Yang, Yi
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6301 - 6309
[6] Learning Event-Specific Localization Preferences for Audio-Visual Event Localization
Ge, Shiping
Jiang, Zhiwei
Yin, Yafeng
Wang, Cong
Cheng, Zifeng
Gu, Qing
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3446 - 3454
[7] Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
Cheng, Ying
Wang, Ruize
Pan, Zhihao
Feng, Rui
Zhang, Yuejie
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3884 - 3892
[8] Audio-visual event detection based on mining of semantic audio-visual labels
Goh, KS
Miyahara, K
Radhakrishan, R
Xiong, ZY
Divakaran, A
STORAGE AND RETRIEVAL METHODS AND APPLICATIONS FOR MULTIMEDIA 2004, 2004, 5307 : 292 - 299
[9] Temporal Cross-Modal Attention for Audio-Visual Event Localization
Nagasaki Y.
Hayashi M.
Kaneko N.
Aoki Y.
Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering, 2022, 88 (03): : 263 - 268
[10] Dynamic interactive learning network for audio-visual event localization
Chen, Jincai
Liang, Han
Wang, Ruili
Zeng, Jiangfeng
Lu, Ping
APPLIED INTELLIGENCE, 2023, 53 (24) : 30431 - 30442

← 1 2 3 4 5 →