Self-Supervised Fine-Grained Cycle-Separation Network (FSCN) for Visual-Audio Separation

被引:2
作者
Ji, Yanli [1 ]
Ma, Shuo [1 ]
Xu, Xing [1 ]
Li, Xuelong [2 ]
Shen, Heng Tao [1 ,3 ]
机构
[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu, Peoples R China
[2] Northwestern Polytech Univ, Xian 710072, Peoples R China
[3] Peng Cheng Lab, Shenzhen 518000, Peoples R China
关键词
Audio source separation; Fine-grained Cycle-Separation (FCSN) Network; Self-supervised learning; Visual-guided separation;
D O I
10.1109/TMM.2022.3200282
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Audio mixture separation is still challenging due to heavy overlaps and interactions. To correctly separate audio mixtures, we propose a novel self-supervised Fine-grained Cycle-Separation Network (FCSN) for vision-guided audio mixture separation. In the proposed approach, we design a two-stage procedure to perform self-supervised separation on audio mixtures. Using visual information as guidance, a primary-stage separation is realized via a U-net network, then the residual spectrogram is calculated by removing separated spectrograms from the original audio mixture. At the second-stage separation, a cycle-separation module is proposed to refine separation using separated results and the residual spectrogram. Self-supervision learning between vision and audio modalities is presented to push the cycle separation until the residual spectrogram becomes empty. Extensive experiments are evaluated on three large-scale datasets, MUSIC (MUSIC-21), AudioSet, and VGGSound. Experiment results certify that our approach outperforms the state-of-the-art approaches, and demonstrate the effectiveness for separating audio mixtures with overlap and interaction.
引用
收藏
页码:5864 / 5876
页数:13
相关论文
共 56 条
[1]  
Afouras T, 2018, INTERSPEECH, P3244
[2]   Objects that Sound [J].
Arandjelovic, Relja ;
Zisserman, Andrew .
COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 :451-466
[3]   Look, Listen and Learn [J].
Arandjelovic, Relja ;
Zisserman, Andrew .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :609-617
[4]   Monoaural Audio Source Separation Using Deep Convolutional Neural Networks [J].
Chandna, Pritish ;
Miron, Marius ;
Janer, Jordi ;
Gomez, Emilia .
LATENT VARIABLE ANALYSIS AND SIGNAL SEPARATION (LVA/ICA 2017), 2017, 10169 :258-266
[5]  
Chen HL, 2020, INT CONF ACOUST SPEE, P721, DOI [10.1109/icassp40776.2020.9053174, 10.1109/ICASSP40776.2020.9053174]
[6]   Music Gesture for Visual Sound Separation [J].
Gan, Chuang ;
Huang, Deng ;
Zhao, Hang ;
Tenenbaum, Joshua B. ;
Torralba, Antonio .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10475-10484
[7]   Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation [J].
Ephrat, Ariel ;
Mosseri, Inbar ;
Lang, Oran ;
Dekel, Tali ;
Wilson, Kevin ;
Hassidim, Avinatan ;
Freeman, William T. ;
Rubinstein, Michael .
ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04)
[8]  
Gabbay A, 2018, INTERSPEECH, P1170
[9]   Co-Separating Sounds of Visual Objects [J].
Gao, Ruohan ;
Grauman, Kristen .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3878-3887
[10]   Learning to Separate Object Sounds by Watching Unlabeled Video [J].
Gao, Ruohan ;
Feris, Rogerio ;
Grauman, Kristen .
COMPUTER VISION - ECCV 2018, PT III, 2018, 11207 :36-54