Towards an efficient backbone for preserving features in speech emotion recognition: deep-shallow convolution with recurrent neural network

被引：6

作者：

Goel, Dev Priya ^{[1
]}

Mahajan, Kushagra ^{[1
]}

Ngoc Duy Nguyen ^{[2
]}

Srinivasan, Natesan ^{[1
]}

Lim, Chee Peng ^{[2
]}

机构：

[1] Indian Inst Technol, Dept Math, Gauhati 781039, India

[2] Deakin Univ, Inst Intelligent Syst Res & Innovat, Waurn Ponds, Vic 3216, Australia

来源：

NEURAL COMPUTING & APPLICATIONS | 2023年 / 35卷 / 03期

关键词：

Speech emotion recognition; SER; FishNet; Log-mel spectrogram; Deep learning; IEMOCAP; RAVDESS; Human machine interaction; CNN; RNN; MODEL;

D O I：

10.1007/s00521-022-07723-2

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech emotion recognition (SER) has attracted a great deal of research interest, which plays as a critical role in human-machine interactions. Unlike other visual tasks, SER becomes intractable when the convolutional neural networks (CNNs) are employed, owing to their limitation in handling log-mel spectrograms. Therefore, it is useful to establish a feature-extraction backbone that allows CNNs to maintain information integrity of speech utterances when utilizing log-mel spectrograms. Moreover, a neural network with a deep stack of layers can lead to a performance degradation due to various challenges, including information loss, overfitting, or vanishing gradient issues. Many studies employ hybrid/multi-modal methods or specialized network designs to mitigate these obstacles. However, those methods often are unstable, hard to configure and non-adaptive to different tasks. In this research, we propose a reusable backbone pertaining to CNN blocks for undertaking SER tasks, as inspired by the FishNet model. denoted as deep-swallow convolution with RNN (DSCRNN), this proposed backbone method preserves features from both deep and shallow layers, which is effective in improving quality of features extracted from log-mel spectrograms. Simulation results indicate that our proposed DSCRNN backbone achieves improved accuracy rates of 2% and 11% when comparing with those from a baseline model with traditional CNN blocks in a speaker-independent evaluation utilizing the RAVDESS dataset with 4 classes and 8 classes, respectively.

引用

页码：2457 / 2469

页数：13

共 37 条

[1]

Abdullah S., 2021, Journal of Applied Science and Technology Trends, V2, P52, DOI DOI 10.38094/JASTT20291

[2]

[Anonymous], 2019, LEARNING TEMPORAL CL, DOI DOI 10.21437/INTERSPEECH.2019-3068

[3] The role of intonation in emotional expressions [J].

Bänziger, T ;

Scherer, KR .

SPEECH COMMUNICATION, 2005, 46 (3-4) :252-267

[4] Emotion, decision making and the orbitofrontal cortex [J].

Bechara, A ;

Damasio, H ;

Damasio, AR .

CEREBRAL CORTEX, 2000, 10 (03) :295-307

[5] Regulation and entrainment in human-robot interaction [J].

Breazeal, C .

INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2002, 21 (10-11) :883-902

[6]

Cen L., 2016, EMOTIONS TECHNOLOGY, P27, DOI DOI 10.1016/B978-0-12-801856-9.00002-5

[7] Speech emotion recognition: Features and classification models [J].

Chen, Lijiang ;

Mao, Xia ;

Xue, Yuli ;

Cheng, Lee Lung .

DIGITAL SIGNAL PROCESSING, 2012, 22 (06) :1154-1160

[8] 3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition [J].

Chen, Mingyi ;

He, Xuanji ;

Yang, Jing ;

Zhang, Han .

IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (10) :1440-1444

[9] Emotion recognition in human-computer interaction [J].

Cowie, R ;

Douglas-Cowie, E ;

Tsapatsoulis, N ;

Votsis, G ;

Kollias, S ;

Fellenz, W ;

Taylor, JG .

IEEE SIGNAL PROCESSING MAGAZINE, 2001, 18 (01) :32-80

[10] Perceiving emotion: towards a realistic understanding of the task [J].

Cowie, Roddy .

PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 2009, 364 (1535) :3515-3525

← 1 2 3 4 →