Audio-Visual Segmentation with Semantics

被引:0
作者
Zhou, Jinxing [1 ]
Shen, Xuyang [2 ]
Wang, Jianyuan [3 ]
Zhang, Jiayi [4 ]
Sun, Weixuan [2 ]
Zhang, Jing [5 ]
Birchfield, Stan [6 ]
Guo, Dan [1 ]
Kong, Lingpeng [7 ]
Wang, Meng [1 ]
Zhong, Yiran [2 ]
机构
[1] Hefei Univ Technol, Hefei, Peoples R China
[2] Shanghai AI Lab, Shanghai, Peoples R China
[3] Univ Oxford, Oxford, England
[4] Beihang Univ, Beijing, Peoples R China
[5] Australian Natl Univ, Canberra, Australia
[6] Nvidia, Santa Clara, CA USA
[7] Univ Hong Kong, Hong Kong, Peoples R China
关键词
Audio-visual segmentation; Multi-modal segmentation; Audio-visual learning; AVSBench; Semantic segmentation; Video segmentation; SOUND;
D O I
10.1007/s11263-024-02261-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark, i.e., AVSBench, providing pixel-wise annotations for sounding objects in audible videos. It contains three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source; 2) fully-supervised audio-visual segmentation with multiple sound sources, and 3) fully-supervised audio-visual semantic segmentation. The first two settings need to generate binary masks of sounding objects indicating pixels corresponding to the audio, while the third setting further requires to generate semantic maps indicating the object category. To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench dataset compare our approach to several existing methods for related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code can be found at https://github.com/OpenNLPLab/AVSBench.
引用
收藏
页码:1644 / 1664
页数:21
相关论文
共 85 条
  • [11] Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning
    Chen, Yuhua
    Pont-Tuset, Jordi
    Montes, Alberto
    Van Gool, Luc
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1189 - 1198
  • [12] Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing
    Cheng, Haoyue
    Liu, Zhaoyang
    Zhou, Hang
    Qian, Chen
    Wu, Wayne
    Wang, Limin
    [J]. COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 431 - 448
  • [13] Cheng JC, 2017, Arxiv, DOI arXiv:1709.04609
  • [14] Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
    Cheng, Ying
    Wang, Ruize
    Pan, Zhihao
    Feng, Rui
    Zhang, Yuejie
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3884 - 3892
  • [15] Lip Reading Sentences in the Wild
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3444 - 3450
  • [16] Lip Reading in the Wild
    Chung, Joon Son
    Zisserman, Andrew
    [J]. COMPUTER VISION - ACCV 2016, PT II, 2017, 10112 : 87 - 103
  • [17] SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation
    Duke, Brendan
    Ahmed, Abdalla
    Wolf, Christian
    Aarabi, Parham
    Taylor, Graham W.
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5908 - 5917
  • [18] Faktor Alon, 2014, BRIT MACH VIS C BMVC
  • [19] Co-Separating Sounds of Visual Objects
    Gao, Ruohan
    Grauman, Kristen
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 3878 - 3887
  • [20] Learning to Separate Object Sounds by Watching Unlabeled Video
    Gao, Ruohan
    Feris, Rogerio
    Grauman, Kristen
    [J]. COMPUTER VISION - ECCV 2018, PT III, 2018, 11207 : 36 - 54