DHHN: Dual Hierarchical Hybrid Network for Weakly-Supervised Audio-Visual Video Parsing

被引:18
|
作者
Jiang, Xun [1 ]
Xu, Xing [1 ]
Chen, Zhiguo [1 ]
Zhang, Jingran [1 ]
Song, Jingkuan [1 ]
Shen, Fumin [1 ]
Lu, Huimin [2 ]
Shen, Heng Tao [1 ,3 ]
机构
[1] Univ Elect Sci & Technol China, Chengdu, Peoples R China
[2] Kyushu Inst Technol, Kitakyushu, Fukuoka, Japan
[3] Peng Cheng Lab, Shenzhen, Peoples R China
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
关键词
Multimodality; Weakly-supervised Learning; Video Understanding; Audio-Visual Comprehension;
D O I
10.1145/3503161.3548309
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The Weakly-Supervised Audio-Visual Video Parsing (AVVP) task aims to parse a video into temporal segments and predict their event categories in terms of modalities, labeling them as either audible, visible, or both. Since the temporal boundaries and modalities annotations are not provided, only video-level event labels are available, this task is more challenging than conventional video understanding tasks. Most previous works attempt to analyze videos by jointly modeling the audio and video data and then learning information from the segment-level features with fixed lengths. However, such a design exist two defects: 1) The various semantic information hidden in temporal lengths is neglected, which may lead the models to learn incorrect information; 2) Due to the joint context modeling, the unique features of different modalities are not fully explored. In this paper, we propose a novel AVVP framework termed Dual Hierarchical Hybrid Network (DHHN) to tackle the above two problems. Our DHHN method consists of three components: 1) A hierarchical context modeling network for extracting different semantics in multiple temporal lengths; 2) A modality-wise guiding network for learning unique information from different modalities; 3) A dual-stream framework generating audio and visual predictions separately. It maintains the best adaptions on different modalities, further boosting the video parsing performance. Extensive quantitative and qualitative experiments demonstrate that our proposed method establishes the new state-of-the-art performance on the AVVP task.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing
    Wu, Yu
    Yang, Yi
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1326 - 1335
  • [2] Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing
    Mo, Shentong
    Tian, Yapeng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [3] Boosting Positive Segments for Weakly-Supervised Audio-Visual Video Parsing
    Rachavarapu, Kranthi Kumar
    Rajagopalan, A. N.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 10158 - 10168
  • [4] Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective
    Fan, Yingying
    Wu, Yu
    Du, Bo
    Lin, Yutian
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective
    Fan, Yingying
    Wu, Yu
    Du, Bo
    Lin, Yutian
    Advances in Neural Information Processing Systems, 2023, 36
  • [6] Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing
    Cheng, Haoyue
    Liu, Zhaoyang
    Zhou, Hang
    Qian, Chen
    Wu, Wayne
    Wang, Limin
    COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 431 - 448
  • [7] Multimodal Imbalance-Aware Gradient Modulation for Weakly-Supervised Audio-Visual Video Parsing
    Fu, Jie
    Gao, Junyu
    Bao, Bing-Kun
    Xu, Changsheng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (06) : 4843 - 4856
  • [8] Multi-Level Signal Fusion for Enhanced Weakly-Supervised Audio-Visual Video Parsing
    Sun, Xin
    Wang, Xuan
    Liu, Qiong
    Zhou, Xi
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 1149 - 1153
  • [9] Weakly-Supervised Audio-Visual Video Parsing with Prototype-based Pseudo-Labeling
    Rachavarapu, Kranthi Kumar
    Ramakrishnan, Kalyan
    Rajagopalan, A. N.
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 18952 - 18962
  • [10] Weakly-Supervised Audio-Visual Segmentation
    Mo, Shentong
    Raj, Bhiksha
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,