Speech Emotion Recognition via Generation using an Attention-based Variational Recurrent Neural Network

被引:11
作者
Baruah, Murchana [1 ]
Banerjee, Bonny
机构
[1] Univ Memphis, Inst Intelligent Syst, Memphis, TN 38152 USA
来源
INTERSPEECH 2022 | 2022年
关键词
Speech emotion recognition; recognition by generation; variational RNN; MFCC; attention; active inference; predictive coding; FEATURES;
D O I
10.21437/Interspeech.2022-753
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The last decade has seen an exponential rise in the number of attention-based models for speech emotion recognition (SER). Most of these models use a spectrogram as the input speech representation and the CNN or RNN or convolutional RNN as the key machine learning (ML) component, and learn feature weights to implement attention. We propose an attention-based model for SER that uses MFCC as the input speech representation and a variational RNN (VRNN) as the key ML component. Since the MFCC is of lower dimension than a spectrogram, the model is size- and data-efficient. The VRNN has been used for problems in vision but rarely for SER. Our model is predictive in nature. At each instant, it infers the emotion class and generates the next observation, computes the generation error, and selectively samples (attends to) the locations of high error. Thus, attention emerges in our model, and does not require learning feature weights. This simple model provides interesting insights when evaluated for SER on benchmark datasets. The model can operate on variable length and infinite duration audio files. This work is the first to explore simultaneous generation and recognition for SER, where the generation capability is necessary for efficient recognition.
引用
收藏
页码:4710 / 4714
页数:5
相关论文
共 36 条
[1]   Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers [J].
Akcay, Mehmet Berkehan ;
Oguz, Kaya .
SPEECH COMMUNICATION, 2020, 116 :56-76
[2]  
[Anonymous], 2020, ELECTRONICS SWITZ, DOI DOI 10.3390/ELECTRONICS9050713
[3]  
[Anonymous], 2016, IEEE IND ELEC
[4]  
[Anonymous], 2017, ARXIV171208708
[5]  
Baruah M., 2020, P COGSCI VIRT 29 JUL, P1171
[6]  
Baruah M., 2020, CVPR WORKSH
[7]   An Attention-Based Predictive Agent for Static and Dynamic Environments [J].
Baruah, Murchana ;
Banerjee, Bonny ;
Nagar, Atulya K. .
IEEE ACCESS, 2022, 10 :17310-17317
[8]   Attention and Feature Selection for Automatic Speech Emotion Recognition Using Utterance and Syllable-Level Prosodic Features [J].
Ben Alex, Starlet ;
Mary, Leena ;
Babu, Ben P. .
CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2020, 39 (11) :5681-5709
[9]   3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition [J].
Chen, Mingyi ;
He, Xuanji ;
Yang, Jing ;
Zhang, Han .
IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (10) :1440-1444
[10]  
Chung J., 2015, ADV NEURAL INFORM PR, P2980, DOI DOI 10.48550/ARXIV.1506.02216