ResCapsnet: a capsule network with CRAM and BiGRU for sound event detection

被引：0

作者：

Sun, Bing ^{[1
]}

Liu, Chenglong ^{[1
]}

Yang, Shuguo ^{[1
]}

Wang, Wenwu ^{[2
]}

Mei, Yiduo ^{[3
]}

机构：

[1] Qingdao Univ Sci & Technol, Coll Math & Phys, Qingdao 266061, Peoples R China

[2] Univ Surrey, Ctr Vis Speech & Signal Proc CVSSP, Sch Comp Sci & Elect Engn, Guildford GU2 7XH, England

[3] Inspur Yunzhou Ind Internet Co Ltd, Jinan 250101, Peoples R China

来源：

EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING | 2025年 / 2025卷 / 01期

关键词：

Sound event detection; Capsule network; CRAM; BiGRU; Dynamic routing; NEURAL-NETWORKS; CLASSIFICATION; LOCALIZATION; RECOGNITION;

D O I：

10.1186/s13636-025-00409-2

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Sound event detection (SED) is a challenging task where ambient sound events are detected from a given audio signal, which includes categorizing the events and estimating their onset and offset times. Deep learning methods such as convolutional neural networks (CNN) and recurrent neural networks (RNN) have achieved promising performance in SED. However, for overlapping sound events, existing deep learning methods are still limited in detecting individual sound events from their mixtures. Inspired by the success of the dynamic routing mechanism of the capsule network (CapsNet), this paper proposes a capsule network model (ResCapsnet-BiGRU) based on a customized residual attention module (CRAM) and bidirectional gated recurrent unit (BiGRU). CRAM is utilized to extract features from log-mel spectrograms that are relevant to sound events. Through dynamic routing, the capsule network can address the overlapping sound events problem. In addition, the BiGRU with time-distributed fully connected layers is adopted to obtain contextual information. Our proposed method was evaluated on two datasets: the Vehicle Weakly Labeled Sound Dataset (VWLSD , DCASE 2017 Task 4) and the Domestic Environment Sound Dataset (DESD , DCASE 2022 Task 4). It achieved F-scores of 62.1% and 75.9% on the Audio Tagging (AT) task, and 54.1% and 59.0% on the sound event detection (SED) task, respectively. The source codes are available at https://github.com/123sunbing/ResCapsnet.git.

引用

页数：14

共 50 条

[1] Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks [J].

Adavanne, Sharath ;

Politis, Archontis ;

Nikunen, Joonas ;

Virtanen, Tuomas .

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2019, 13 (01) :34-48

[2]

[Anonymous], 2017, Detection and classification of acoustic scenes and events

[3]

[Anonymous], 2017, P DETECTION CLASSIFI

[4]

Lipton ZC, 2015, Arxiv, DOI arXiv:1506.00019

[5]

Cakir E., 2016, DETECTION CLASSIFICA

[6] Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection [J].

Cakir, Emre ;

Parascandolo, Giambattista ;

Heittola, Toni ;

Huttunen, Heikki ;

Virtanen, Tuomas .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (06) :1291-1303

[7]

Cao Y., 2019, arXiv

[8] Musical Instrument Sound Multi-Excitation Model for Non-Negative Spectrogram Factorization [J].

Carabias-Orti, J. J. ;

Virtanen, T. ;

Vera-Candeas, P. ;

Ruiz-Reyes, N. ;

Canadas-Quesada, F. J. .

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2011, 5 (06) :1144-1158

[9]

Chung J., 2014, EMPIRICAL EVALUATION

[10]

cs.tut.fi, 2017, DCASE 2017 Task4

← 1 2 3 4 5 →