Deep Scalogram Representations for Acoustic Scene Classification

被引:82
作者
Ren, Zhao [1 ]
Qian, Kun [1 ,2 ]
Zhang, Zixing [3 ]
Pandit, Vedhas [1 ]
Baird, Alice [1 ]
Schuller, Bjoern [1 ,3 ]
机构
[1] Univ Augsburg, ZDB Chair Embedded Intelligence Hlth Care & Wellb, Augsburg, Germany
[2] Tech Univ Munich, Machine Intelligence & Signal Proc Grp, Munich, Germany
[3] Imperial Coll London, GLAM, London, England
关键词
Acoustic scene classification (ASC); (bidirectional) gated recurrent neural networks ((B) GRNNs); convolutional neural networks (CNNs); deep scalogram representation; spectrogram representation; SOUND CLASSIFICATION; EMOTION;
D O I
10.1109/JAS.2018.7511066
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Spectrogram representations of acoustic scenes have achieved competitive performance for acoustic scene classification. Yet, the spectrogram alone does not take into account a substantial amount of time-frequency information. In this study, we present an approach for exploring the benefits of deep scalogram representations, extracted in segments from an audio stream. The approach presented firstly transforms the segmented acoustic scenes into bump and morse scalograms, as well as spectrograms; secondly, the spectrograms or scalograms are sent into pre-trained convolutional neural networks; thirdly, the features extracted from a subsequent fully connected layer are fed into (bidirectional) gated recurrent neural networks, which are followed by a single highway layer and a softmax layer; finally, predictions from these three systems are fused by a margin sampling value strategy. We then evaluate the proposed approach using the acoustic scene classification data set of 2017 IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE). On the evaluation set, an accuracy of 64.0 % from bidirectional gated recurrent neural networks is obtained when fusing the spectrogram and the bump scalogram, which is an improvement on the 61.0 % baseline result provided by the DCASE 2017 organisers. This result shows that extracted bump scalograms are capable of improving the classification accuracy, when fusing with a spectrogram-based system.
引用
收藏
页码:662 / 669
页数:8
相关论文
共 53 条
[1]  
Amiriparian S., 2017, P DCASE 2017, P17
[2]   Snore Sound Classification Using Image-based Deep Spectrum Features [J].
Amiriparian, Shahin ;
Gerczuk, Maurice ;
Ottl, Sandra ;
Cummins, Nicholas ;
Freitag, Michael ;
Pugachevskiy, Sergey ;
Baird, Alice ;
Schuller, Bjoern .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :3512-3516
[3]  
[Anonymous], 2013, Proceedings of the 21st ACM International Conference on Multimedia, DOI DOI 10.1145/2502081.2502224
[4]  
[Anonymous], 2015, P INT C LEARN REPR 2
[5]  
[Anonymous], 2017, P DET CLASS AC SCEN
[6]  
[Anonymous], P DET CLASS AC SCEN
[7]  
[Anonymous], 2015, Highway networks
[8]  
Bae SH, 2016, DCASE, P11
[9]  
Chung J., 2014, EMPIRICAL EVALUATION
[10]   THE WAVELET TRANSFORM, TIME-FREQUENCY LOCALIZATION AND SIGNAL ANALYSIS [J].
DAUBECHIES, I .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1990, 36 (05) :961-1005