CPM: Class-Conditional Prompting Machine for Audio-Visual Segmentation

被引:0
作者
Chen, Yuanhong [1 ]
Wang, Chong [1 ]
Liu, Yuyuan [1 ]
Wang, Hu [2 ]
Carneiro, Gustavo [3 ]
机构
[1] Univ Adelaide, Australian Inst Machine Learning, Adelaide, SA, Australia
[2] Mohamed bin Zayed Univ Artificial Intelligence, Abu Dhabi, U Arab Emirates
[3] Univ Surrey, Ctr Vis Speech & Signal Proc, Guildford, Surrey, England
来源
COMPUTER VISION - ECCV 2024, PT X | 2025年 / 15068卷
关键词
Audio-visual Learning; Segmentation; Multi-modal Learning;
D O I
10.1007/978-3-031-72684-2_25
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio-visual segmentation (AVS) is an emerging task that aims to accurately segment sounding objects based on audio-visual cues. The success of AVS learning systems depends on the effectiveness of cross-modal interaction. Such a requirement can be naturally fulfilled by leveraging transformer-based segmentation architecture due to its inherent ability to capture long-range dependencies and flexibility in handling different modalities. However, the inherent training issues of transformer-based methods, such as the low efficacy of cross-attention and unstable bipartite matching, can be amplified in AVS, particularly when the learned audio query does not provide a clear semantic clue. In this paper, we address these two issues with the new Class-conditional Prompting Machine (CPM). CPM improves the bipartite matching with a learning strategy combining class-agnostic queries with class-conditional queries. The efficacy of cross-modal attention is upgraded with new learning objectives for the audio, visual and joint modalities. We conduct experiments on AVS benchmarks, demonstrating that our method achieves state-of-the-art (SOTA) segmentation accuracy
引用
收藏
页码:438 / 456
页数:19
相关论文
共 58 条
[1]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[2]   Localizing Visual Sounds the Hard Way [J].
Chen, Honglie ;
Xie, Weidi ;
Afouras, Triantafyllos ;
Nagrani, Arsha ;
Vedaldi, Andrea ;
Zisserman, Andrew .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :16862-16871
[3]   Generative Semantic Segmentation [J].
Chen, Jiaqi ;
Lu, Jiachen ;
Zhu, Xiatian ;
Zhang, Li .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, :7111-7120
[4]   CaMap: Camera-based Map Manipulation on Mobile Devices [J].
Chen, Liang ;
Chen, Dongyi .
PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND APPLICATION ENGINEERING (CSAE2018), 2018,
[5]  
Chen TX, 2024, Arxiv, DOI [arXiv:2402.02327, DOI 10.1109/TCSVT.2024.3486344]
[6]  
Chen T, 2020, PR MACH LEARN RES, V119
[7]  
Chen Yijin, 2023, arXiv
[8]  
Cheng B, 2021, ADV NEUR IN, V34
[9]   Masked-attention Mask Transformer for Universal Image Segmentation [J].
Cheng, Bowen ;
Misra, Ishan ;
Schwing, Alexander G. ;
Kirillov, Alexander ;
Girdhar, Rohit .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :1280-1289
[10]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38