DIFFCRNN: A NOVEL APPROACH FOR DETECTING SOUND EVENTS IN SMART HOME SYSTEMS USING DIFFUSION-BASED CONVOLUTIONAL RECURRENT NEURAL NETWORK

被引:0
作者
Al Dabel, Maryam M. [1 ]
机构
[1] Univ Hafr Al Batin, Coll Comp Sci & Engn, Dept Comp Sci & Engn, Hafar Al Batin, Saudi Arabia
来源
SCALABLE COMPUTING-PRACTICE AND EXPERIENCE | 2024年 / 25卷 / 05期
关键词
Sound event detection; latent diffusion model; spectrogram; deep neural network;
D O I
10.12694/scpe.v25i5.3031
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
This paper presents a latent diffusion model and convolutional recurrent neural network for detecting sound event, fusing advantages of different networks together to advance security applications and smart home systems. The proposed approach underwent initial training using extensive datasets and subsequently applied transfer learning to adapt to the desired task to effectively mitigate the challenge of limited data availability. It employs the latent diffusion model to get a discrete representation that is compressed from the mel-spectrogram of audio. Subsequently a convolutional neural network (CNN) is linked as the front-end of recurrent neural network (RNN) which produces a feature map. After that, an attention module predicts attention maps in temporal-spectral dimensions level, from the feature map. The input spectrogram is subsequently multiplied with the generated attention maps for adaptive feature refinement. Finally, trainable scalar weights aggregate the fine-tuned features from the back-end RNN. The experimental findings show that the proposed method performs better compared to the state-of-art using three datasets: the DCASE2016-SED, DCASE2017-SED and URBAN-SED. In experiments on the first dataset, DCASE2016-SED, the performance of the approach reached a peak in F 1 of 66.2% and ER of 0.42. Using the second dataset, DCASE2017-SED, the results indicate that the F 1 and ER achieved 68.1% and 0.40, respectively. Further investigation with the third dataset, URBAN-SED, demonstrates that our proposed approach significantly outperforms existing alternatives as 74.3% and 0.44 for the F 1 and ER.
引用
收藏
页码:3796 / 3811
页数:16
相关论文
共 49 条
[1]   Data Augmentation and Deep Learning Methods in Sound Classification: A Systematic Review [J].
Abayomi-Alli, Olusola O. ;
Damasevicius, Robertas ;
Qazi, Atika ;
Adedoyin-Olowe, Mariam ;
Misra, Sanjay .
ELECTRONICS, 2022, 11 (22)
[2]  
Adavanne S, 2017, Arxiv, DOI arXiv:1710.02997
[3]   Interpretation of intelligence in CNN-pooling processes: a methodological survey [J].
Akhtar, Nadeem ;
Ragavendran, U. .
NEURAL COMPUTING & APPLICATIONS, 2020, 32 (03) :879-898
[4]  
Ba J, 2014, ACS SYM SER
[5]   A NEW DCASE 2017 RARE SOUND EVENT DETECTION BENCHMARK UNDER EQUAL TRAINING DATA: CRNN WITH MULTI-WIDTH KERNELS [J].
Baumann, Jan ;
Meyer, Patrick ;
Lohrenz, Timo ;
Roy, Alexander ;
Papendieck, Michael ;
Fingscheidt, Tim .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :865-869
[6]  
Cakir E, 2015, IEEE IJCNN
[7]   Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection [J].
Cakir, Emre ;
Parascandolo, Giambattista ;
Heittola, Toni ;
Huttunen, Heikki ;
Virtanen, Tuomas .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (06) :1291-1303
[8]  
Cakir E, 2015, EUR SIGNAL PR CONF, P2551, DOI 10.1109/EUSIPCO.2015.7362845
[9]   Separating the Effects of Batch Normalization on CNN Training Speed and Stability Using Classical Adaptive Filter Theory [J].
Chai, Elaina ;
Pilanci, Mert ;
Murmann, Boris .
2020 54TH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS, AND COMPUTERS, 2020, :1214-1221
[10]  
Chung HW, 2022, Arxiv, DOI [arXiv:2210.11416, 10.48550/arXiv.2210.11416]