SPEECH EMOTION RECOGNITION USING QUATERNION CONVOLUTIONAL NEURAL NETWORKS

被引：47

作者：

Muppidi, Aneesh ^{[1
]}

Radfar, Martin ^{[1
]}

机构：

[1] SUNY Stony Brook, Dept Comp Sci, Stony Brook, NY 11794 USA

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

Speech Emotion Recognition; Signal Processing; Quaternion Deep Learning; Convolutional Neural Networks;

D O I：

10.1109/ICASSP39728.2021.9414248

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Although speech recognition has become a widespread technology, inferring emotion from speech signals remains a challenge. Our paper addresses this problem by proposing a quaternion convolutional neural network (QCNN) based speech emotion recognition (SER) model in which Mel-spectrogram features of speech signals are encoded in an RGB quaternion domain. We demonstrate that our QCNN based SER model outperforms other real-valued methods in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS, 8-classes) dataset, achieving, to the best of our knowledge, state-of-the-art results. The QCNN model also achieves comparable results with state-of-the-art methods in the Interactive Emotional Dyadic Motion Capture (IEMOCAP 4-classes) and Berlin EMO-DB (7-classes) datasets. Specifically, the model achieves an accuracy of 77.87%, 70.46%, and 88.78% for the RAVDESS, IEMOCAP, and EMO-DB datasets, respectively. Additionally, model size results reveal that the quaternion unit structure is significantly better able to encode internal dependencies than real-valued structures.

引用

页码：6309 / 6313

页数：5

共 30 条

[1]

An SM, 2017, ASIAPAC SIGN INFO PR, P1563, DOI 10.1109/APSIPA.2017.8282282

[2]

Ashar Aweem, 2020, 2020 INT C EM TRENDS, P1, DOI [DOI 10.1109/ICETST49965.2020.9080730, 10.1109/ICETST49965.2020.9080730]

[3] Automatic speech recognition and speech variability: A review [J].

Benzeghiba, M. ;

De Mori, R. ;

Deroo, O. ;

Dupont, S. ;

Erbes, T. ;

Jouvet, D. ;

Fissore, L. ;

Laface, P. ;

Mertins, A. ;

Ris, C. ;

Rose, R. ;

Tyagi, V. ;

Wellekens, C. .

SPEECH COMMUNICATION, 2007, 49 (10-11) :763-786

[4]

Burkhardt F., 2005, INTERSPEECH, P1517

[5] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[6]

Chao Li, 2019, 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). Proceedings, P105, DOI 10.1109/ACIIW.2019.8925283

[7]

Ghosal D, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P154

[8]

Glorot Xavier, 2010, JMLR WORKSHOP C P, P249, DOI DOI 10.1109/LGRS.2016.2565705

[9]

Goel S., 2020, ARXIV200307996

[10]

Graves A, 2013, 2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P273, DOI 10.1109/ASRU.2013.6707742

← 1 2 3 →