A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism

被引:107
作者
Lieskovska, Eva [1 ]
Jakubec, Maros [1 ]
Jarina, Roman [1 ]
Chmulik, Michal [1 ]
机构
[1] Univ Zilina, Fac Elect Engn & Informat Technol, Univ 8215-1, Zilina 01026, Slovakia
关键词
speech emotion recognition; deep learning; attention mechanism; recurrent neural network; long short-term memory; DATA AUGMENTATION; NEURAL-NETWORKS; FEATURES; AUDIO; CLASSIFIERS; PARAMETERS; DOMINANCE; DATABASES; AROUSAL; MODEL;
D O I
10.3390/electronics10101163
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Emotions are an integral part of human interactions and are significant factors in determining user satisfaction or customer opinion. speech emotion recognition (SER) modules also play an important role in the development of human-computer interaction (HCI) applications. A tremendous number of SER systems have been developed over the last decades. Attention-based deep neural networks (DNNs) have been shown as suitable tools for mining information that is unevenly time distributed in multimedia content. The attention mechanism has been recently incorporated in DNN architectures to emphasise also emotional salient information. This paper provides a review of the recent development in SER and also examines the impact of various attention mechanisms on SER performance. Overall comparison of the system accuracies is performed on a widely used IEMOCAP benchmark database.
引用
收藏
页数:29
相关论文
共 112 条
[61]   The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English [J].
Livingstone, Steven R. ;
Russo, Frank A. .
PLOS ONE, 2018, 13 (05)
[62]   Dialysis-requiring acute renal failure increases the risk of progressive chronic kidney disease [J].
Lo, Lowell J. ;
Go, Alan S. ;
Chertow, Glenn M. ;
McCulloch, Charles E. ;
Fan, Dongjie ;
Ordonez, Juan D. ;
Hsu, Chi-yuan .
KIDNEY INTERNATIONAL, 2009, 76 (08) :893-899
[63]   Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech from Existing Podcast Recordings [J].
Lotfian, Reza ;
Busso, Carlos .
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2019, 10 (04) :471-483
[64]  
Luo DQ, 2018, INTERSPEECH, P152
[65]   Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms [J].
Ma, Xi ;
Wu, Zhiyong ;
Jia, Jia ;
Xu, Mingxing ;
Meng, Helen ;
Cai, Lianhong .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :3683-3687
[66]  
Martin O., 2006, 22 INT C DAT ENG WOR, P8, DOI [DOI 10.1109/ICDEW.2006.145, 10.1109/ICDEW.2006.145,8, 10.1109/ICDEW.2006.145]
[67]  
Mirsamadi S, 2017, INT CONF ACOUST SPEE, P2227, DOI 10.1109/ICASSP.2017.7952552
[68]   A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition [J].
Mustaqeem ;
Kwon, Soonil .
SENSORS, 2020, 20 (01)
[69]   Attentive Convolutional Neural Network based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech [J].
Neumann, Michael ;
Ngoc Thang Vu .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :1263-1267
[70]   Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets [J].
Noh, Kyoung Ju ;
Jeong, Chi Yoon ;
Lim, Jiyoun ;
Chung, Seungeun ;
Kim, Gague ;
Lim, Jeong Mook ;
Jeong, Hyuntae .
SENSORS, 2021, 21 (05) :1-18