SPEECH EMOTION RECOGNITION USING QUATERNION CONVOLUTIONAL NEURAL NETWORKS

被引:47
作者
Muppidi, Aneesh [1 ]
Radfar, Martin [1 ]
机构
[1] SUNY Stony Brook, Dept Comp Sci, Stony Brook, NY 11794 USA
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
Speech Emotion Recognition; Signal Processing; Quaternion Deep Learning; Convolutional Neural Networks;
D O I
10.1109/ICASSP39728.2021.9414248
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Although speech recognition has become a widespread technology, inferring emotion from speech signals remains a challenge. Our paper addresses this problem by proposing a quaternion convolutional neural network (QCNN) based speech emotion recognition (SER) model in which Mel-spectrogram features of speech signals are encoded in an RGB quaternion domain. We demonstrate that our QCNN based SER model outperforms other real-valued methods in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS, 8-classes) dataset, achieving, to the best of our knowledge, state-of-the-art results. The QCNN model also achieves comparable results with state-of-the-art methods in the Interactive Emotional Dyadic Motion Capture (IEMOCAP 4-classes) and Berlin EMO-DB (7-classes) datasets. Specifically, the model achieves an accuracy of 77.87%, 70.46%, and 88.78% for the RAVDESS, IEMOCAP, and EMO-DB datasets, respectively. Additionally, model size results reveal that the quaternion unit structure is significantly better able to encode internal dependencies than real-valued structures.
引用
收藏
页码:6309 / 6313
页数:5
相关论文
共 30 条
[1]  
An SM, 2017, ASIAPAC SIGN INFO PR, P1563, DOI 10.1109/APSIPA.2017.8282282
[2]  
Ashar Aweem, 2020, 2020 INT C EM TRENDS, P1, DOI [DOI 10.1109/ICETST49965.2020.9080730, 10.1109/ICETST49965.2020.9080730]
[3]   Automatic speech recognition and speech variability: A review [J].
Benzeghiba, M. ;
De Mori, R. ;
Deroo, O. ;
Dupont, S. ;
Erbes, T. ;
Jouvet, D. ;
Fissore, L. ;
Laface, P. ;
Mertins, A. ;
Ris, C. ;
Rose, R. ;
Tyagi, V. ;
Wellekens, C. .
SPEECH COMMUNICATION, 2007, 49 (10-11) :763-786
[4]  
Burkhardt F., 2005, INTERSPEECH, P1517
[5]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[6]  
Chao Li, 2019, 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). Proceedings, P105, DOI 10.1109/ACIIW.2019.8925283
[7]  
Ghosal D, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P154
[8]  
Glorot Xavier, 2010, JMLR WORKSHOP C P, P249, DOI DOI 10.1109/LGRS.2016.2565705
[9]  
Goel S., 2020, ARXIV200307996
[10]  
Graves A, 2013, 2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P273, DOI 10.1109/ASRU.2013.6707742