Degramnet: effective audio analysis based on a fully learnable time-frequency representation

被引：1

作者：

Foggia, Pasquale ^{[1
]}

Greco, Antonio ^{[1
]}

Roberto, Antonio ^{[1
]}

Saggese, Alessia ^{[1
]}

Vento, Mario ^{[1
]}

机构：

[1] Univ Salerno, Via Giovanni Paolo II 132, Fisciano, SA, Italy

来源：

NEURAL COMPUTING & APPLICATIONS | 2023年 / 35卷 / 27期

关键词：

Deep learning; Audio representation learning; Signal processing; Sound event classification; Speaker identification; NEURAL-NETWORKS; RECOGNITION;

D O I：

10.1007/s00521-023-08849-7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Current state-of-the-art audio analysis algorithms based on deep learning rely on hand-crafted Spectrogram-like audio representations, that are more compact than descriptors obtained from the raw waveform; the latter are, in turn, far from achieving good generalization capabilities when few data are available for the training. However, Spectrogram-like representations have two main limitations: (1) The parameters of the filters are defined a priori, regardless of the specific audio analysis task; (2) such representations do not perform any denoising operation on the audio signal, neither in the time domain nor in the frequency domain. To overcome these limitations, we propose a new general-purpose convolutional architecture for audio analysis tasks that we call DEGramNet, which is trained with audio samples described with a novel, compact and learnable time-frequency representation that we call DEGram. The proposed representation is fully trainable: Indeed, it is able to learn the frequencies of interest for the specific audio analysis task; in addition, it performs denoising through a custom time-frequency attention module, which amplifies the frequency and time components in which the sound is actually located. It implies that the proposed representation can be easily adapted to the specific problem at hands, for instance giving more importance to the voice frequencies when the network needs to be used for speaker recognition. DEGramNet achieved state-of-the-art performance on the VGGSound dataset (for Sound Event Classification) and comparable accuracy with a complex and special-purpose approach based on network architecture search over the VoxCeleb dataset (for Speaker Identification). Moreover, we demonstrate that DEGram allows to achieve high accuracy with lightweight neural networks that can be used in real-time on embedded systems, making the solution suitable for Cognitive Robotics applications.

引用

页码：20207 / 20219

页数：13

共 35 条

[1] Convolutional Neural Networks for Speech Recognition
Abdel-Hamid, Ossama
Mohamed, Abdel-Rahman
Jiang, Hui
Deng, Li
Penn, Gerald
Yu, Dong
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) : 1533 - 1545
[2] Rethinking environmental sound classification using convolutional neural networks: optimized parameter tuning of single feature extraction
Al-Hattab, Yousef Abd
Zaki, Hasan Firdaus
Shafie, Amir Akramin
[J]. NEURAL COMPUTING & APPLICATIONS, 2021, 33 (21) : 14495 - 14506
[3] SHORT-TERM SPECTRAL ANALYSIS, SYNTHESIS, AND MODIFICATION BY DISCRETE FOURIER-TRANSFORM
ALLEN, JB
[J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1977, 25 (03): : 235 - 238
[4] Buckley C., 2004, Proceedings of Sheffield SIGIR 2004. The Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P25, DOI 10.1145/1008992.1009000
[5] Butterworth S., 1930, EXPT WIRELESS WIRELE, V7, P536
[6] Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection
Cakir, Emre
Parascandolo, Giambattista
Heittola, Toni
Huttunen, Heikki
Virtanen, Tuomas
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (06) : 1291 - 1303
[7] Chen HL, 2020, INT CONF ACOUST SPEE, P721, DOI [10.1109/icassp40776.2020.9053174, 10.1109/ICASSP40776.2020.9053174]
[8] COMPARISON OF PARAMETRIC REPRESENTATIONS FOR MONOSYLLABIC WORD RECOGNITION IN CONTINUOUSLY SPOKEN SENTENCES
DAVIS, SB
MERMELSTEIN, P
[J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1980, 28 (04): : 357 - 366
[9] AutoSpeech: Neural Architecture Search for Speaker Recognition
Ding, Shaojin
Chen, Tianlong
Gong, Xinyu
Zha, Weiwei
Wang, Zhangyang
[J]. INTERSPEECH 2020, 2020, : 916 - 920
[10] Fawcett, 2004, MACH LEARN, V31, P1

← 1 2 3 4 →