Emotion Recognition with Speech and Facial Images

被引：0

作者：

Xue, Peiyun ^{[1
,2
]}

Dai, Shutao ^{[1
]}

Bai, Jing ^{[1
]}

Gao, Xiang ^{[1
]}

机构：

[1] Taiyuan Univ Technol, Coll Elect Informat & Opt Engn, Taiyuan 030024, Peoples R China

[2] Shanxi Adv Innovat Res Inst, Postdoctoral Orkstn, Taiyuan 030032, Peoples R China

来源：

JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY | 2024年 / 46卷 / 12期

关键词：

Emotion recognition; Attentionmechanism; Multi-branchconvolution; Residualmixing; Decision fusion; NETWORK;

D O I：

10.11999/JEIT240087

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In order to improve the accuracy of emotion recognition models and solve the problem of insufficient emotional feature extraction, this paper conducts research on bimodal emotion recognition involving audio and facial imagery. In the audio modality, a feature extraction model of a Multi-branch Convolutional Neural Network (MCNN) incorporating a channel-space attention mechanism is proposed, which extracts emotional features from speech spectrograms across time, space, and local feature dimensions. For the facial image modality, a feature extraction model using a Residual Hybrid Convolutional Neural Network (RHCNN) is introduced, which further establishes a parallel attention mechanism that concentrates on global emotional features to enhance recognition accuracy. The emotional features extracted from audio and facial imagery are then classified through separate classification layers, and a decision fusion technique is utilized to amalgamate the classification results. The experimental results indicate that the proposed bimodal fusion model has achieved recognition accuracies of 97.22%, 94.78%, and 96.96% on the RAVDESS, eNTERFACE'05, and RML datasets, respectively. These accuracies signify improvements over single-modality audio recognition by 11.02%, 4.24%, and 8.83%, and single-modality facial image recognition by 4.60%, 6.74%, and 4.10%, respectively. Moreover, the proposed model outperforms related methodologies applied to these datasets in recent years. This illustrates that the advanced bimodal fusion model can effectively focus on emotional information, thereby enhancing the overall accuracy of emotion recognition.

引用

页码：4542 / 4552

页数：11

共 39 条

[1] Two-Way Feature Extraction for Speech Emotion Recognition Using Deep Learning [J].

Aggarwal, Apeksha ;

Srivastava, Akshat ;

Agarwal, Ajay ;

Chahal, Nidhi ;

Singh, Dilbag ;

Alnuaim, Abeer Ali ;

Alhadlaq, Aseel ;

Lee, Heung-No .

SENSORS, 2022, 22 (06)

[2]

[Anonymous], 2006, ENG WORKSH ATL US, P8, DOI [10.1109/ICDEW.2006.145, DOI 10.1109/ICDEW.2006.145]

[3]

[Anonymous], 2018, C COMP VIS MUN GERM, P116, DOI [10.1007/978-3-030-01264-98, DOI 10.1007/978-3-030-01264-98]

[4]

[Anonymous], 2023, Procedia Computer Science, V225, P2556, DOI [10.1016/j.procs.2023.10.247, DOI 10.1016/J.PROCS.2023.10.247]

[5]

BOUALI Y L, 2022, P 6 INT C ADV TECHN, P1, DOI [10.1109/ATSIP55956.2022.9805959, DOI 10.1109/ATSIP55956.2022.9805959]

[6] Multi-Modal Emotion Recognition by Fusing Correlation Features of Speech-Visual [J].

Chen Guanghui ;

Zeng Xiaoping .

IEEE SIGNAL PROCESSING LETTERS, 2021, 28 :533-537

[7] K-Means Clustering-Based Kernel Canonical Correlation Analysis for Multimodal Emotion Recognition in Human-Robot Interaction [J].

Chen, Luefeng ;

Wang, Kuanlin ;

Li, Min ;

Wu, Min ;

Pedrycz, Witold ;

Hirota, Kaoru .

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2023, 70 (01) :1016-1024

[8] A novel dual attention-based BLSTM with hybrid features in speech emotion recognition [J].

Chen, Qiupu ;

Huang, Guimin .

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2021, 102

[9] Xception: Deep Learning with Depthwise Separable Convolutions [J].

Chollet, Francois .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1800-1807

[10]

Cornejo Jadisha, 2019, 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), P111, DOI 10.1109/ICMLA.2019.00026

← 1 2 3 4 →