Degramnet: effective audio analysis based on a fully learnable time-frequency representation

被引:1
作者
Foggia, Pasquale [1 ]
Greco, Antonio [1 ]
Roberto, Antonio [1 ]
Saggese, Alessia [1 ]
Vento, Mario [1 ]
机构
[1] Univ Salerno, Via Giovanni Paolo II 132, Fisciano, SA, Italy
关键词
Deep learning; Audio representation learning; Signal processing; Sound event classification; Speaker identification; NEURAL-NETWORKS; RECOGNITION;
D O I
10.1007/s00521-023-08849-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current state-of-the-art audio analysis algorithms based on deep learning rely on hand-crafted Spectrogram-like audio representations, that are more compact than descriptors obtained from the raw waveform; the latter are, in turn, far from achieving good generalization capabilities when few data are available for the training. However, Spectrogram-like representations have two main limitations: (1) The parameters of the filters are defined a priori, regardless of the specific audio analysis task; (2) such representations do not perform any denoising operation on the audio signal, neither in the time domain nor in the frequency domain. To overcome these limitations, we propose a new general-purpose convolutional architecture for audio analysis tasks that we call DEGramNet, which is trained with audio samples described with a novel, compact and learnable time-frequency representation that we call DEGram. The proposed representation is fully trainable: Indeed, it is able to learn the frequencies of interest for the specific audio analysis task; in addition, it performs denoising through a custom time-frequency attention module, which amplifies the frequency and time components in which the sound is actually located. It implies that the proposed representation can be easily adapted to the specific problem at hands, for instance giving more importance to the voice frequencies when the network needs to be used for speaker recognition. DEGramNet achieved state-of-the-art performance on the VGGSound dataset (for Sound Event Classification) and comparable accuracy with a complex and special-purpose approach based on network architecture search over the VoxCeleb dataset (for Speaker Identification). Moreover, we demonstrate that DEGram allows to achieve high accuracy with lightweight neural networks that can be used in real-time on embedded systems, making the solution suitable for Cognitive Robotics applications.
引用
收藏
页码:20207 / 20219
页数:13
相关论文
共 35 条
  • [1] Convolutional Neural Networks for Speech Recognition
    Abdel-Hamid, Ossama
    Mohamed, Abdel-Rahman
    Jiang, Hui
    Deng, Li
    Penn, Gerald
    Yu, Dong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) : 1533 - 1545
  • [2] Rethinking environmental sound classification using convolutional neural networks: optimized parameter tuning of single feature extraction
    Al-Hattab, Yousef Abd
    Zaki, Hasan Firdaus
    Shafie, Amir Akramin
    [J]. NEURAL COMPUTING & APPLICATIONS, 2021, 33 (21) : 14495 - 14506
  • [3] SHORT-TERM SPECTRAL ANALYSIS, SYNTHESIS, AND MODIFICATION BY DISCRETE FOURIER-TRANSFORM
    ALLEN, JB
    [J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1977, 25 (03): : 235 - 238
  • [4] Buckley C., 2004, Proceedings of Sheffield SIGIR 2004. The Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P25, DOI 10.1145/1008992.1009000
  • [5] Butterworth S., 1930, EXPT WIRELESS WIRELE, V7, P536
  • [6] Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection
    Cakir, Emre
    Parascandolo, Giambattista
    Heittola, Toni
    Huttunen, Heikki
    Virtanen, Tuomas
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (06) : 1291 - 1303
  • [7] Chen HL, 2020, INT CONF ACOUST SPEE, P721, DOI [10.1109/icassp40776.2020.9053174, 10.1109/ICASSP40776.2020.9053174]
  • [8] COMPARISON OF PARAMETRIC REPRESENTATIONS FOR MONOSYLLABIC WORD RECOGNITION IN CONTINUOUSLY SPOKEN SENTENCES
    DAVIS, SB
    MERMELSTEIN, P
    [J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1980, 28 (04): : 357 - 366
  • [9] AutoSpeech: Neural Architecture Search for Speaker Recognition
    Ding, Shaojin
    Chen, Tianlong
    Gong, Xinyu
    Zha, Weiwei
    Wang, Zhangyang
    [J]. INTERSPEECH 2020, 2020, : 916 - 920
  • [10] Fawcett, 2004, MACH LEARN, V31, P1