Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

被引:6
作者
Geng, Tiantian [1 ,2 ]
Wang, Teng [1 ,3 ]
Duan, Jinming [2 ]
Cong, Runmin [4 ]
Zheng, Feng [1 ,5 ]
机构
[1] Southern Univ Sci & Technol, Shenzhen, Peoples R China
[2] Univ Birmingham, Birmingham, England
[3] Univ Hong Kong, Hong Kong, Peoples R China
[4] Shandong Univ, Jinan, Peoples R China
[5] Peng Cheng Lab, Shenzhen, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
D O I
10.1109/CVPR52729.2023.02197
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing audio-visual event localization (AVE) handles manually trimmed videos with only a single instance in each of them. However, this setting is unrealistic as natural videos often contain numerous audio-visual events with different categories. To better adapt to real-life applications, in this paper we focus on the task of denselocalizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video. The problem is challenging as it requires fine-grained audio-visual scene and context understanding. To tackle this problem, we introduce the first Untrimmed Audio-Visual (UnAV-100) dataset, which contains 10K untrimmed videos with over 30K audio-visual events. Each video has 2.8 audio-visual events on average, and the events are usually related to each other and might co-occur as in real-life scenes. Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass. Extensive experiments demonstrate the effectiveness of our method as well as the significance of multi-scale cross-modal perception and dependency modeling for this task. The dataset and code are available at https://unav100.github.io.
引用
收藏
页码:22942 / 22951
页数:10
相关论文
共 46 条
  • [1] Andrew JJ, 2018, NAMED ENTITIES, P1
  • [2] Look, Listen and Learn
    Arandjelovic, Relja
    Zisserman, Andrew
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 609 - 617
  • [3] Soft-NMS - Improving Object Detection With One Line of Code
    Bodla, Navaneeth
    Singh, Bharat
    Chellappa, Rama
    Davis, Larry S.
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5562 - 5570
  • [4] Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
  • [5] Cakir Emre, 2015, Proceedings of IEEE International Joint Conference on Neural Networks, DOI [10.1109/IJCNN.2015.7280624, DOI 10.1109/IJCNN.2015.7280624]
  • [6] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [7] Localizing Visual Sounds the Hard Way
    Chen, Honglie
    Xie, Weidi
    Afouras, Triantafyllos
    Nagrani, Arsha
    Vedaldi, Andrea
    Zisserman, Andrew
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 16862 - 16871
  • [8] Chen HL, 2020, INT CONF ACOUST SPEE, P721, DOI [10.1109/icassp40776.2020.9053174, 10.1109/ICASSP40776.2020.9053174]
  • [9] CHURCH KW, 1990, 27TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, P76
  • [10] The VIA Annotation Software for Images, Audio and Video
    Dutta, Abhishek
    Zisserman, Andrew
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 2276 - 2279