Multi-channel spectrograms for speech processing applications using deep learning methods

被引:0
作者
T. Arias-Vergara
P. Klumpp
J. C. Vasquez-Correa
E. Nöth
J. R. Orozco-Arroyave
M. Schuster
机构
[1] Universidad de Antioquia UdeA,Faculty of Engineering
[2] Friedrich-Alexander University,Pattern Recognition Lab
[3] Ludwig-Maximilians University,Department of Otorhinolaryngology, Head and Neck Surgery
来源
Pattern Analysis and Applications | 2021年 / 24卷
关键词
Speech processing; Multi-channel spectrograms; Cochlear implants; Phoneme recognition;
D O I
暂无
中图分类号
学科分类号
摘要
Time–frequency representations of the speech signals provide dynamic information about how the frequency component changes with time. In order to process this information, deep learning models with convolution layers can be used to obtain feature maps. In many speech processing applications, the time–frequency representations are obtained by applying the short-time Fourier transform and using single-channel input tensors to feed the models. However, this may limit the potential of convolutional networks to learn different representations of the audio signal. In this paper, we propose a methodology to combine three different time–frequency representations of the signals by computing continuous wavelet transform, Mel-spectrograms, and Gammatone spectrograms and combining then into 3D-channel spectrograms to analyze speech in two different applications: (1) automatic detection of speech deficits in cochlear implant users and (2) phoneme class recognition to extract phone-attribute features. For this, two different deep learning-based models are considered: convolutional neural networks and recurrent neural networks with convolution layers.
引用
收藏
页码:423 / 431
页数:8
相关论文
共 26 条
[1]  
Purwins H(2019)Deep learning for audio signal processing IEEE J Sel Top Signal Process 13 206-219
[2]  
Li B(2018)Voice pathology detection using deep learning on mobile healthcare framework IEEE Access 6 41034-41041
[3]  
Virtanen T(2014)Convolutional neural networks for speech recognition IEEE/ACM Trans Audio Speech Lang Process 22 1533-1545
[4]  
Schlüter J(1998)Gradient-based learning applied to document recognition Proc IEEE 86 2278-2324
[5]  
Chang S(2019)Speech differences between CI users with pre-and postlingual onset of deafness detected by speech processing methods on voiceless to voice transitions Laryngo-Rhino-Otologie 98 11435-2830
[6]  
Sainath T(2011)Scikit-learn: machine learning in python J Mach Learn Res 12 2825-undefined
[7]  
Alhussein M(undefined)undefined undefined undefined undefined-undefined
[8]  
Muhammad G(undefined)undefined undefined undefined undefined-undefined
[9]  
Abdel-Hamid O(undefined)undefined undefined undefined undefined-undefined
[10]  
Mohamed A(undefined)undefined undefined undefined undefined-undefined