AVSegFormer: Audio-Visual Segmentation with Transformer

被引:0
作者
Gao, Shengyi [1 ]
Chen, Zhe [1 ]
Chen, Guo [1 ]
Wang, Wenhai [2 ]
Lu, Tong [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China
[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China
来源
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 11 | 2024年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio-visual segmentation (AVS) aims to locate and segment the sounding objects in a given video, which demands audio-driven pixel-level scene understanding. The existing methods cannot fully process the fine-grained correlations between audio and visual cues across various situations dynamically. They also face challenges in adapting to complex scenarios, such as evolving audio, the coexistence of multiple objects, and more. In this paper, we propose AVSegFormer, a novel framework for AVS that leverages the transformer architecture. Specifically, It comprises a dense audio-visual mixer, which can dynamically adjust interested visual features, and a sparse audio-visual decoder, which implicitly separates audio sources and automatically matches optimal visual features. Combining both components provides a more robust bidirectional conditional multi-modal representation, improving the segmentation performance in different scenarios. Extensive experiments demonstrate that AVSegFormer achieves state-of-the-art results on the AVS benchmark. The code is available at https://github.com/vvvb-github/AVSegFormer.
引用
收藏
页码:12155 / 12163
页数:9
相关论文
共 46 条
[1]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[2]   Objects that Sound [J].
Arandjelovic, Relja ;
Zisserman, Andrew .
COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 :451-466
[3]   Look, Listen and Learn [J].
Arandjelovic, Relja ;
Zisserman, Andrew .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :609-617
[4]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[5]   Localizing Visual Sounds the Hard Way [J].
Chen, Honglie ;
Xie, Weidi ;
Afouras, Triantafyllos ;
Nagrani, Arsha ;
Vedaldi, Andrea ;
Zisserman, Andrew .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :16862-16871
[6]   Hybrid Task Cascade for Instance Segmentation [J].
Chen, Kai ;
Pang, Jiangmiao ;
Wang, Jiaqi ;
Xiong, Yu ;
Li, Xiaoxiao ;
Sun, Shuyang ;
Feng, Wansen ;
Liu, Ziwei ;
Shi, Jianping ;
Ouyang, Wanli ;
Loy, Chen Change ;
Lin, Dahua .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4969-4978
[7]  
Chen Z., 2022, INT C LEARN REPR
[8]   Masked-attention Mask Transformer for Universal Image Segmentation [J].
Cheng, Bowen ;
Misra, Ishan ;
Schwing, Alexander G. ;
Kirillov, Alexander ;
Girdhar, Rohit .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :1280-1289
[9]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[10]   TransVG: End-to-End Visual Grounding with Transformers [J].
Deng, Jiajun ;
Yang, Zhengyuan ;
Chen, Tianlang ;
Zhou, Wengang ;
Li, Houqiang .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1749-1759