A CROSS-ATTENTION EMOTION RECOGNITION ALGORITHM BASED ON AUDIO AND VIDEO MODALITIES

被引:0
作者
Wu, Xiao [1 ]
Mu, Xuan [1 ]
Qi, Wen [3 ]
Liu, Xiaorui [1 ,2 ]
机构
[1] Qingdao Univ, Automat Sch, 308 Ningxia Rd, Qingdao 266000, Peoples R China
[2] Shandong Key Lab Ind Control, Qingdao 266071, Peoples R China
[3] South China Univ Technol, Sch Future Technol, Guangzhou 511436, Peoples R China
来源
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024 | 2024年
关键词
multimodal; emotion recognition; parallel convolution; cross attention;
D O I
10.1109/ICASSPW62465.2024.10626511
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In recent years, emotion recognition has received significant attention. In this paper, multimodal information, including speech and facial expressions, is adopted to realize human emotion classification. Firstly, we propose a speech recognition model based on the Parallel convolutional module (Pconv), and an expression emotion recognition model based on the improved Inception-ResnetV2 network. The recognize futures of speech and expression will be further fused by using a cross-attention module coordinated with Bidirectional Long Short-Term Memory (BiLSTM). The experimental results organized on CH-SIMS and CMU-MOSI datasets have demonstrated that the proposed algorithm achieves high recognition accuracy. Each component of this model could contribute to performance improvement in the fair way.
引用
收藏
页码:309 / 313
页数:5
相关论文
共 19 条
  • [1] Designing efficient accelerator of depthwise separable convolutional neural network on FPGA
    Ding, Wei
    Huang, Zeyu
    Huang, Zunkai
    Tian, Li
    Wang, Hui
    Feng, Songlin
    [J]. JOURNAL OF SYSTEMS ARCHITECTURE, 2019, 97 : 278 - 286
  • [2] Dutta K, 2012, PROCEEDINGS OF THE 2012 INTERNATIONAL CONFERENCE ON COMMUNICATIONS, DEVICES AND INTELLIGENT SYSTEMS (CODLS), P600, DOI 10.1109/CODIS.2012.6422274
  • [3] FINDING STRUCTURE IN TIME
    ELMAN, JL
    [J]. COGNITIVE SCIENCE, 1990, 14 (02) : 179 - 211
  • [4] Video-Based Emotion Recognition using CNN-RNN and C3D Hybrid Networks
    Fan, Yin
    Lu, Xiangju
    Li, Dian
    Liu, Yuanliu
    [J]. ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, : 445 - 450
  • [5] Ku Hongchang., 2020, Frontiers in Signal Processing, V4, P37, DOI [10.22606/fsp.2020.41006, DOI 10.22606/FSP.2020.41006]
  • [6] Gradient-based learning applied to document recognition
    Lecun, Y
    Bottou, L
    Bengio, Y
    Haffner, P
    [J]. PROCEEDINGS OF THE IEEE, 1998, 86 (11) : 2278 - 2324
  • [7] Lin S Y, 2021, Signal Processing, V37, P1889
  • [8] Fusing audio, visual and textual clues for sentiment analysis from multimodal content
    Poria, Soujanya
    Cambria, Erik
    Howard, Newton
    Huang, Guang-Bin
    Hussain, Amir
    [J]. NEUROCOMPUTING, 2016, 174 : 50 - 59
  • [9] Rish I., 2001, IJCAI 2001 WORKSH EM, V3, P41, DOI DOI 10.1039/B104835J
  • [10] End-to-End Multimodal Emotion Recognition Using Deep Neural Networks
    Tzirakis, Panagiotis
    Trigeorgis, George
    Nicolaou, Mihalis A.
    Schuller, Bjorn W.
    Zafeiriou, Stefanos
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1301 - 1309