Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser

被引:0
|
作者
Lai, Yung-Hsuan [1 ]
Chen, Yen-Chun [2 ]
Wang, Yu-Chiang Frank [1 ,3 ]
机构
[1] Natl Taiwan Univ, Taipei, Taiwan
[2] Microsoft, Redmond, WA USA
[3] NVIDIA, Santa Clara, CA USA
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年
关键词
NETWORK;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio-visual learning has been a major pillar of multi-modal machine learning, where the community mostly focused on its modality-aligned setting, i.e., the audio and visual modality are both assumed to signal the prediction target. With the Look, Listen, and Parse dataset (LLP), we investigate the under-explored unaligned setting, where the goal is to recognize audio and visual events in a video with only weak labels observed. Such weak video-level labels only tell what events happen without knowing the modality they are perceived (audio, visual, or both). To enhance learning in this challenging setting, we incorporate large-scale contrastively pre-trained models as the modality teachers. A simple, effective, and generic method, termed Visual-Audio Label Elaboration ( VALOR), is innovated to harvest modality labels for the training events. Empirical studies show that the harvested labels significantly improve an attentional baseline by 8.0 in average F-score (Type@AV). Surprisingly, we found that modality-independent teachers outperform their modality-fused counterparts since they are noise-proof from the other potentially unaligned modality. Moreover, our best model achieves the new state-of-the-art on all metrics of LLP by a substantial margin (+5.4 F-score for Type@AV). VALOR is further generalized to Audio-Visual Event Localization and achieves the new state-of-the-art as well.
引用
收藏
页数:19
相关论文
共 36 条
  • [1] Weakly-Supervised Audio-Visual Segmentation
    Mo, Shentong
    Raj, Bhiksha
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [2] A Closer Look at Weakly-Supervised Audio-Visual Source Localization
    Mo, Shentong
    Morgado, Pedro
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [3] Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing
    Wu, Yu
    Yang, Yi
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1326 - 1335
  • [4] Boosting Positive Segments for Weakly-Supervised Audio-Visual Video Parsing
    Rachavarapu, Kranthi Kumar
    Rajagopalan, A. N.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 10158 - 10168
  • [5] Exploring Event Misalignment Bias and Segment Focus Bias for Weakly-Supervised Audio-Visual Video Parsing
    Li, Mingchi
    Han, Songrui
    Yuan, Xiaobing
    2024 6TH INTERNATIONAL CONFERENCE ON BIG-DATA SERVICE AND INTELLIGENT COMPUTATION, BDSIC 2024, 2024, : 48 - 56
  • [6] Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing
    Lin, Yan-Bo
    Tseng, Hung-Yu
    Lee, Hsin-Ying
    Lin, Yen-Yu
    Yang, Ming-Hsuan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [7] Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective
    Fan, Yingying
    Wu, Yu
    Du, Bo
    Lin, Yutian
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [8] Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective
    Fan, Yingying
    Wu, Yu
    Du, Bo
    Lin, Yutian
    Advances in Neural Information Processing Systems, 2023, 36
  • [9] Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception
    Gao, Junyu
    Chen, Mengyuan
    Xu, Changsheng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18827 - 18836
  • [10] Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection
    Yu, Jiashuo
    Liu, Jinyu
    Cheng, Ying
    Feng, Rui
    Zhang, Yuejie
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 6278 - 6287