A CROSS-ATTENTION EMOTION RECOGNITION ALGORITHM BASED ON AUDIO AND VIDEO MODALITIES

被引：0

作者：

Wu, Xiao ^{[1
]}

Mu, Xuan ^{[1
]}

Qi, Wen ^{[3
]}

Liu, Xiaorui ^{[1
,2
]}

机构：

[1] Qingdao Univ, Automat Sch, 308 Ningxia Rd, Qingdao 266000, Peoples R China

[2] Shandong Key Lab Ind Control, Qingdao 266071, Peoples R China

[3] South China Univ Technol, Sch Future Technol, Guangzhou 511436, Peoples R China

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024 | 2024年

关键词：

multimodal; emotion recognition; parallel convolution; cross attention;

D O I：

10.1109/ICASSPW62465.2024.10626511

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In recent years, emotion recognition has received significant attention. In this paper, multimodal information, including speech and facial expressions, is adopted to realize human emotion classification. Firstly, we propose a speech recognition model based on the Parallel convolutional module (Pconv), and an expression emotion recognition model based on the improved Inception-ResnetV2 network. The recognize futures of speech and expression will be further fused by using a cross-attention module coordinated with Bidirectional Long Short-Term Memory (BiLSTM). The experimental results organized on CH-SIMS and CMU-MOSI datasets have demonstrated that the proposed algorithm achieves high recognition accuracy. Each component of this model could contribute to performance improvement in the fair way.

引用

页码：309 / 313

页数：5

共 19 条

[1] Designing efficient accelerator of depthwise separable convolutional neural network on FPGA
Ding, Wei
Huang, Zeyu
Huang, Zunkai
Tian, Li
Wang, Hui
Feng, Songlin
[J]. JOURNAL OF SYSTEMS ARCHITECTURE, 2019, 97 : 278 - 286
[2] Dutta K, 2012, PROCEEDINGS OF THE 2012 INTERNATIONAL CONFERENCE ON COMMUNICATIONS, DEVICES AND INTELLIGENT SYSTEMS (CODLS), P600, DOI 10.1109/CODIS.2012.6422274
[3] FINDING STRUCTURE IN TIME
ELMAN, JL
[J]. COGNITIVE SCIENCE, 1990, 14 (02) : 179 - 211
[4] Video-Based Emotion Recognition using CNN-RNN and C3D Hybrid Networks
Fan, Yin
Lu, Xiangju
Li, Dian
Liu, Yuanliu
[J]. ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, : 445 - 450
[5] Ku Hongchang., 2020, Frontiers in Signal Processing, V4, P37, DOI [10.22606/fsp.2020.41006, DOI 10.22606/FSP.2020.41006]
[6] Gradient-based learning applied to document recognition
Lecun, Y
Bottou, L
Bengio, Y
Haffner, P
[J]. PROCEEDINGS OF THE IEEE, 1998, 86 (11) : 2278 - 2324
[7] Lin S Y, 2021, Signal Processing, V37, P1889
[8] Fusing audio, visual and textual clues for sentiment analysis from multimodal content
Poria, Soujanya
Cambria, Erik
Howard, Newton
Huang, Guang-Bin
Hussain, Amir
[J]. NEUROCOMPUTING, 2016, 174 : 50 - 59
[9] Rish I., 2001, IJCAI 2001 WORKSH EM, V3, P41, DOI DOI 10.1039/B104835J
[10] End-to-End Multimodal Emotion Recognition Using Deep Neural Networks
Tzirakis, Panagiotis
Trigeorgis, George
Nicolaou, Mihalis A.
Schuller, Bjorn W.
Zafeiriou, Stefanos
[J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1301 - 1309

← 1 2 →