CPM: Class-Conditional Prompting Machine for Audio-Visual Segmentation

被引：0

作者：

Chen, Yuanhong ^{[1
]}

Wang, Chong ^{[1
]}

Liu, Yuyuan ^{[1
]}

Wang, Hu ^{[2
]}

Carneiro, Gustavo ^{[3
]}

机构：

[1] Univ Adelaide, Australian Inst Machine Learning, Adelaide, SA, Australia

[2] Mohamed bin Zayed Univ Artificial Intelligence, Abu Dhabi, U Arab Emirates

[3] Univ Surrey, Ctr Vis Speech & Signal Proc, Guildford, Surrey, England

来源：

COMPUTER VISION - ECCV 2024, PT X | 2025年 / 15068卷

关键词：

Audio-visual Learning; Segmentation; Multi-modal Learning;

D O I：

10.1007/978-3-031-72684-2_25

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Audio-visual segmentation (AVS) is an emerging task that aims to accurately segment sounding objects based on audio-visual cues. The success of AVS learning systems depends on the effectiveness of cross-modal interaction. Such a requirement can be naturally fulfilled by leveraging transformer-based segmentation architecture due to its inherent ability to capture long-range dependencies and flexibility in handling different modalities. However, the inherent training issues of transformer-based methods, such as the low efficacy of cross-attention and unstable bipartite matching, can be amplified in AVS, particularly when the learned audio query does not provide a clear semantic clue. In this paper, we address these two issues with the new Class-conditional Prompting Machine (CPM). CPM improves the bipartite matching with a learning strategy combining class-agnostic queries with class-conditional queries. The efficacy of cross-modal attention is upgraded with new learning objectives for the audio, visual and joint modalities. We conduct experiments on AVS benchmarks, demonstrating that our method achieves state-of-the-art (SOTA) segmentation accuracy

引用

页码：438 / 456

页数：19

共 58 条

[1] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[2] Localizing Visual Sounds the Hard Way [J].

Chen, Honglie ;

Xie, Weidi ;

Afouras, Triantafyllos ;

Nagrani, Arsha ;

Vedaldi, Andrea ;

Zisserman, Andrew .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :16862-16871

[3] Generative Semantic Segmentation [J].

Chen, Jiaqi ;

Lu, Jiachen ;

Zhu, Xiatian ;

Zhang, Li .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, :7111-7120

[4] CaMap: Camera-based Map Manipulation on Mobile Devices [J].

Chen, Liang ;

Chen, Dongyi .

PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND APPLICATION ENGINEERING (CSAE2018), 2018,

[5]

Chen TX, 2024, Arxiv, DOI [arXiv:2402.02327, DOI 10.1109/TCSVT.2024.3486344]

[6]

Chen T, 2020, PR MACH LEARN RES, V119

[7]

Chen Yijin, 2023, arXiv

[8]

Cheng B, 2021, ADV NEUR IN, V34

[9] Masked-attention Mask Transformer for Universal Image Segmentation [J].

Cheng, Bowen ;

Misra, Ishan ;

Schwing, Alexander G. ;

Kirillov, Alexander ;

Girdhar, Rohit .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :1280-1289

[10] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].

DEMPSTER, AP ;

LAIRD, NM ;

RUBIN, DB .

JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38

← 1 2 3 4 5 6 →