Self-Supervised Fine-Grained Cycle-Separation Network (FSCN) for Visual-Audio Separation

被引：2

作者：

Ji, Yanli ^{[1
]}

Ma, Shuo ^{[1
]}

Xu, Xing ^{[1
]}

Li, Xuelong ^{[2
]}

Shen, Heng Tao ^{[1
,3
]}

机构：

[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu, Peoples R China

[2] Northwestern Polytech Univ, Xian 710072, Peoples R China

[3] Peng Cheng Lab, Shenzhen 518000, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2023年 / 25卷

关键词：

Audio source separation; Fine-grained Cycle-Separation (FCSN) Network; Self-supervised learning; Visual-guided separation;

D O I：

10.1109/TMM.2022.3200282

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Audio mixture separation is still challenging due to heavy overlaps and interactions. To correctly separate audio mixtures, we propose a novel self-supervised Fine-grained Cycle-Separation Network (FCSN) for vision-guided audio mixture separation. In the proposed approach, we design a two-stage procedure to perform self-supervised separation on audio mixtures. Using visual information as guidance, a primary-stage separation is realized via a U-net network, then the residual spectrogram is calculated by removing separated spectrograms from the original audio mixture. At the second-stage separation, a cycle-separation module is proposed to refine separation using separated results and the residual spectrogram. Self-supervision learning between vision and audio modalities is presented to push the cycle separation until the residual spectrogram becomes empty. Extensive experiments are evaluated on three large-scale datasets, MUSIC (MUSIC-21), AudioSet, and VGGSound. Experiment results certify that our approach outperforms the state-of-the-art approaches, and demonstrate the effectiveness for separating audio mixtures with overlap and interaction.

引用

页码：5864 / 5876

页数：13

共 56 条

[1]

Afouras T, 2018, INTERSPEECH, P3244

[2] Objects that Sound [J].

Arandjelovic, Relja ;

Zisserman, Andrew .

COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 :451-466

[3] Look, Listen and Learn [J].

Arandjelovic, Relja ;

Zisserman, Andrew .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :609-617

[4] Monoaural Audio Source Separation Using Deep Convolutional Neural Networks [J].

Chandna, Pritish ;

Miron, Marius ;

Janer, Jordi ;

Gomez, Emilia .

LATENT VARIABLE ANALYSIS AND SIGNAL SEPARATION (LVA/ICA 2017), 2017, 10169 :258-266

[5]

Chen HL, 2020, INT CONF ACOUST SPEE, P721, DOI [10.1109/icassp40776.2020.9053174, 10.1109/ICASSP40776.2020.9053174]

[6] Music Gesture for Visual Sound Separation [J].

Gan, Chuang ;

Huang, Deng ;

Zhao, Hang ;

Tenenbaum, Joshua B. ;

Torralba, Antonio .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10475-10484

[7] Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation [J].

Ephrat, Ariel ;

Mosseri, Inbar ;

Lang, Oran ;

Dekel, Tali ;

Wilson, Kevin ;

Hassidim, Avinatan ;

Freeman, William T. ;

Rubinstein, Michael .

ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04)

[8]

Gabbay A, 2018, INTERSPEECH, P1170

[9] Co-Separating Sounds of Visual Objects [J].

Gao, Ruohan ;

Grauman, Kristen .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3878-3887

[10] Learning to Separate Object Sounds by Watching Unlabeled Video [J].

Gao, Ruohan ;

Feris, Rogerio ;

Grauman, Kristen .

COMPUTER VISION - ECCV 2018, PT III, 2018, 11207 :36-54

← 1 2 3 4 5 6 →