TimeScaleNet: A Multiresolution Approach for Raw Audio Recognition Using Learnable Biquadratic IIR Filters and Residual Networks of Depthwise-Separable One-Dimensional Atrous Convolutions

被引:8
作者
Bavu, Eric [1 ]
Ramamonjy, Aro [1 ]
Pujol, Hadrien [1 ]
Garcia, Alexandre [1 ]
机构
[1] Conservatoire Natl Arts & Metiers, Lab Mecan Struct & Syst Couples, F-75003 Paris, France
关键词
Machine hearing; audio recognition; learnable biquadratic filters; deep learning; time domain modelling; multiresolution; DEEP NEURAL-NETWORKS; CLASSIFICATION; MULTILEVEL;
D O I
10.1109/JSTSP.2019.2908696
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this paper, we show the benefit of a multi-resolution approach that allows us to encode the relevant information contained in unprocessed time-domain acoustic signals. TimeScaleNet aims at learning an efficient representation of a sound, by learning time dependencies both at the sample level and at the frame level. The proposed approach allows us to improve the interpretability of the learning scheme, by unifying advanced deep learning and signal processing techniques. In particular, TimeScaleNet's architecture introduces a new form of recurrent neural layer, which is directly inspired from digital infinite impulse-response (IIR) signal processing. This layer acts as a learnable passband biquadratic digital IIR filterbank. The learnable filterbank allows us to build a time-frequency-like feature map that self-adapts to the specific recognition task and dataset, with a large receptive field and very few learnable parameters. The obtained frame-level feature map is then processed using a residual network of depthwise separable atrous convolutions. This second scale of analysis aims at efficiently encoding relationships between the time fluctuations at the frame timescale, in different learnt pooled frequency bands, in the range of [20 ms ; 200 ms]. TimeScaleNet is tested both using the Speech Commands Dataset and the ESC-10 Dataset. We report a high mean accuracy of 94.87 +/- 0.24% (macro averaged Fl-score : 94.9 +/- 0.24%) for speech recognition, and a rather moderate accuracy of 69.71 +/- 1.91% (macro averaged Fl-score : 70.14 +/- 1.57%) for the environmental sound classification task.
引用
收藏
页码:220 / 235
页数:16
相关论文
共 59 条
  • [1] Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
  • [2] Alim S. A., 2018, IntechOpen
  • [3] [Anonymous], HEARING RES
  • [4] Ba J. L., 2016, Layer Normalization, DOI 10.48550/arXiv.1607.06450
  • [5] Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection
    Cakir, Emre
    Parascandolo, Giambattista
    Heittola, Toni
    Huttunen, Heikki
    Virtanen, Tuomas
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (06) : 1291 - 1303
  • [6] Choi K, 2017, INT CONF ACOUST SPEE, P2392, DOI 10.1109/ICASSP.2017.7952585
  • [7] Chung JY, 2015, PR MACH LEARN RES, V37, P2067
  • [8] Dai W, 2017, INT CONF ACOUST SPEE, P421, DOI 10.1109/ICASSP.2017.7952190
  • [9] Spectrogram Image Feature for Sound Event Classification in Mismatched Conditions
    Dennis, Jonathan
    Tran, Huy Dat
    Li, Haizhou
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2011, 18 (02) : 130 - 133
  • [10] Dieleman Sander, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P6964, DOI 10.1109/ICASSP.2014.6854950