Weakly-Supervised Audio-Visual Segmentation

被引:0
作者
Mo, Shentong [1 ,2 ]
Raj, Bhiksha [1 ,2 ]
机构
[1] CMU, Pittsburgh, PA 15213 USA
[2] MBZUAI, Abu Dhabi, U Arab Emirates
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio-visual segmentation is a challenging task that aims to predict pixel-level masks for sound sources in a video. Previous work applied a comprehensive manually designed architecture with countless pixel-wise accurate masks as supervision. However, these pixel-level masks are expensive and not available in all cases. In this work, we aim to simplify the supervision as the instance-level annotation, i.e., weakly-supervised audio-visual segmentation. We present a novel Weakly-Supervised Audio-Visual Segmentation framework, namely WS-AVS, that can learn multi-scale audio-visual alignment with multi-scale multiple-instance contrastive learning for audio-visual segmentation. Extensive experiments on AVS-Bench demonstrate the effectiveness of our WS-AVS in the weakly-supervised audio-visual segmentation of single-source and multi-source scenarios.
引用
收藏
页数:14
相关论文
共 50 条
[21]   MAVT-FG: Multimodal Audio-Visual Transformer for Weakly-supervised Fine-Grained Recognition [J].
Zhou, Xiaoyu ;
Song, Xiaotong ;
Wu, Hao ;
Zhang, Jingran ;
Xu, Xing .
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :3811-3819
[22]   Audio-Visual Weakly Supervised Approach for Apathy Detection in the Elderly [J].
Sharma, Garima ;
Joshi, Jyoti ;
Zeghari, Radia ;
Guerchouche, Rachid .
2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
[23]   Weakly Supervised Representation Learning for Audio-Visual Scene Analysis [J].
Parekh, Sanjeel ;
Essid, Slim ;
Ozerov, Alexey ;
Ngoc Q K Duong ;
Perez, Patrick ;
Richard, Gael .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 (28) :416-428
[24]   Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception [J].
Gao, Junyu ;
Chen, Mengyuan ;
Xu, Changsheng .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :18827-18836
[25]   Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing [J].
Lin, Yan-Bo ;
Tseng, Hung-Yu ;
Lee, Hsin-Ying ;
Lin, Yen-Yu ;
Yang, Ming-Hsuan .
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[26]   Audio-Visual Segmentation [J].
Zhou, Jinxing ;
Wang, Jianyuan ;
Zhang, Jiayi ;
Sun, Weixuan ;
Zhang, Jing ;
Birchfield, Stan ;
Guo, Dan ;
Kong, Lingpeng ;
Wang, Meng ;
Zhong, Yiran .
COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 :386-403
[27]   SELF-SUPERVISED AUDIO-VISUAL CO-SEGMENTATION [J].
Rouditchenko, Andrew ;
Zhao, Hang ;
Gan, Chuang ;
McDermott, Josh ;
Torralba, Antonio .
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, :2357-2361
[28]   Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection [J].
Yu, Jiashuo ;
Liu, Jinyu ;
Cheng, Ying ;
Feng, Rui ;
Zhang, Yuejie .
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :6278-6287
[29]   Learning weakly supervised audio-visual violence detection in hyperbolic space [J].
Zhou, Xiao ;
Peng, Xiaogang ;
Wen, Hao ;
Luo, Yikai ;
Yu, Keyang ;
Yang, Ping ;
Wu, Zizhao .
IMAGE AND VISION COMPUTING, 2024, 151
[30]   Weakly-Supervised Text Instance Segmentation [J].
Zu, Xinyan ;
Yu, Haiyang ;
Li, Bin ;
Xue, Xiangyang .
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, :1915-1923