Towards an efficient backbone for preserving features in speech emotion recognition: deep-shallow convolution with recurrent neural network

被引:6
作者
Goel, Dev Priya [1 ]
Mahajan, Kushagra [1 ]
Ngoc Duy Nguyen [2 ]
Srinivasan, Natesan [1 ]
Lim, Chee Peng [2 ]
机构
[1] Indian Inst Technol, Dept Math, Gauhati 781039, India
[2] Deakin Univ, Inst Intelligent Syst Res & Innovat, Waurn Ponds, Vic 3216, Australia
关键词
Speech emotion recognition; SER; FishNet; Log-mel spectrogram; Deep learning; IEMOCAP; RAVDESS; Human machine interaction; CNN; RNN; MODEL;
D O I
10.1007/s00521-022-07723-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech emotion recognition (SER) has attracted a great deal of research interest, which plays as a critical role in human-machine interactions. Unlike other visual tasks, SER becomes intractable when the convolutional neural networks (CNNs) are employed, owing to their limitation in handling log-mel spectrograms. Therefore, it is useful to establish a feature-extraction backbone that allows CNNs to maintain information integrity of speech utterances when utilizing log-mel spectrograms. Moreover, a neural network with a deep stack of layers can lead to a performance degradation due to various challenges, including information loss, overfitting, or vanishing gradient issues. Many studies employ hybrid/multi-modal methods or specialized network designs to mitigate these obstacles. However, those methods often are unstable, hard to configure and non-adaptive to different tasks. In this research, we propose a reusable backbone pertaining to CNN blocks for undertaking SER tasks, as inspired by the FishNet model. denoted as deep-swallow convolution with RNN (DSCRNN), this proposed backbone method preserves features from both deep and shallow layers, which is effective in improving quality of features extracted from log-mel spectrograms. Simulation results indicate that our proposed DSCRNN backbone achieves improved accuracy rates of 2% and 11% when comparing with those from a baseline model with traditional CNN blocks in a speaker-independent evaluation utilizing the RAVDESS dataset with 4 classes and 8 classes, respectively.
引用
收藏
页码:2457 / 2469
页数:13
相关论文
共 37 条
[1]  
Abdullah S., 2021, Journal of Applied Science and Technology Trends, V2, P52, DOI DOI 10.38094/JASTT20291
[2]  
[Anonymous], 2019, LEARNING TEMPORAL CL, DOI DOI 10.21437/INTERSPEECH.2019-3068
[3]   The role of intonation in emotional expressions [J].
Bänziger, T ;
Scherer, KR .
SPEECH COMMUNICATION, 2005, 46 (3-4) :252-267
[4]   Emotion, decision making and the orbitofrontal cortex [J].
Bechara, A ;
Damasio, H ;
Damasio, AR .
CEREBRAL CORTEX, 2000, 10 (03) :295-307
[5]   Regulation and entrainment in human-robot interaction [J].
Breazeal, C .
INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2002, 21 (10-11) :883-902
[6]  
Cen L., 2016, EMOTIONS TECHNOLOGY, P27, DOI DOI 10.1016/B978-0-12-801856-9.00002-5
[7]   Speech emotion recognition: Features and classification models [J].
Chen, Lijiang ;
Mao, Xia ;
Xue, Yuli ;
Cheng, Lee Lung .
DIGITAL SIGNAL PROCESSING, 2012, 22 (06) :1154-1160
[8]   3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition [J].
Chen, Mingyi ;
He, Xuanji ;
Yang, Jing ;
Zhang, Han .
IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (10) :1440-1444
[9]   Emotion recognition in human-computer interaction [J].
Cowie, R ;
Douglas-Cowie, E ;
Tsapatsoulis, N ;
Votsis, G ;
Kollias, S ;
Fellenz, W ;
Taylor, JG .
IEEE SIGNAL PROCESSING MAGAZINE, 2001, 18 (01) :32-80