Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

被引：6

作者：

Geng, Tiantian ^{[1
,2
]}

Wang, Teng ^{[1
,3
]}

Duan, Jinming ^{[2
]}

Cong, Runmin ^{[4
]}

Zheng, Feng ^{[1
,5
]}

机构：

[1] Southern Univ Sci & Technol, Shenzhen, Peoples R China

[2] Univ Birmingham, Birmingham, England

[3] Univ Hong Kong, Hong Kong, Peoples R China

[4] Shandong Univ, Jinan, Peoples R China

[5] Peng Cheng Lab, Shenzhen, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

D O I：

10.1109/CVPR52729.2023.02197

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Existing audio-visual event localization (AVE) handles manually trimmed videos with only a single instance in each of them. However, this setting is unrealistic as natural videos often contain numerous audio-visual events with different categories. To better adapt to real-life applications, in this paper we focus on the task of denselocalizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video. The problem is challenging as it requires fine-grained audio-visual scene and context understanding. To tackle this problem, we introduce the first Untrimmed Audio-Visual (UnAV-100) dataset, which contains 10K untrimmed videos with over 30K audio-visual events. Each video has 2.8 audio-visual events on average, and the events are usually related to each other and might co-occur as in real-life scenes. Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass. Extensive experiments demonstrate the effectiveness of our method as well as the significance of multi-scale cross-modal perception and dependency modeling for this task. The dataset and code are available at https://unav100.github.io.

引用

页码：22942 / 22951

页数：10

共 46 条

[1] Andrew JJ, 2018, NAMED ENTITIES, P1
[2] Look, Listen and Learn
Arandjelovic, Relja
Zisserman, Andrew
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 609 - 617
[3] Soft-NMS - Improving Object Detection With One Line of Code
Bodla, Navaneeth
Singh, Bharat
Chellappa, Rama
Davis, Larry S.
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5562 - 5570
[4] Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[5] Cakir Emre, 2015, Proceedings of IEEE International Joint Conference on Neural Networks, DOI [10.1109/IJCNN.2015.7280624, DOI 10.1109/IJCNN.2015.7280624]
[6] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Carreira, Joao
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
[7] Localizing Visual Sounds the Hard Way
Chen, Honglie
Xie, Weidi
Afouras, Triantafyllos
Nagrani, Arsha
Vedaldi, Andrea
Zisserman, Andrew
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 16862 - 16871
[8] Chen HL, 2020, INT CONF ACOUST SPEE, P721, DOI [10.1109/icassp40776.2020.9053174, 10.1109/ICASSP40776.2020.9053174]
[9] CHURCH KW, 1990, 27TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, P76
[10] The VIA Annotation Software for Images, Audio and Video
Dutta, Abhishek
Zisserman, Andrew
[J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 2276 - 2279

← 1 2 3 4 5 →