Audio-Visual Fusion Network Based on Conformer for Multimodal Emotion Recognition

被引:3
|
作者
Guo, Peini [1 ,2 ]
Chen, Zhengyan [1 ]
Li, Yidi [1 ]
Liu, Hong [1 ]
机构
[1] Peking Univ, Shenzhen Grad Sch, Key Lab Machine Percept, Shenzhen, Peoples R China
[2] Shanghai Univ, Shanghai, Peoples R China
来源
ARTIFICIAL INTELLIGENCE, CICAI 2022, PT II | 2022年 / 13605卷
关键词
Multimodal emotion recognition; Audio-visual fusion; Convolutional Neural Network; Transformer;
D O I
10.1007/978-3-031-20500-2_26
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio-visual emotion recognition aims to integrate audio and visual information for accurate emotion prediction, which is widely used in real application scenarios. However, most existing methods lack fully exploiting complementary information within modalities to obtain rich feature representations related to emotions. Recently, Transformer and CNN-based models achieve remarkable results in the field of automatic speech recognition. Motivated by this, we propose a novel audiovisual fusion network based on 3D-CNN and Convolution-augmented Transformer (Conformer) for multimodal emotion recognition. Firstly, the 3D-CNN is employed to process face sequences extracted from the video, and the 1D-CNN is used to process MFCC features of audio signals. Secondly, the visual and audio features are fed into a feature fusion module, which contains a set of convolutional layers for extracting local features and the self-attention mechanism for capturing global interactions of multimodal information. Finally, the fused features are input into linear layers to obtain the prediction results. To verify the effectiveness of the proposed method, experiments are performed on RAVDESS and a newly collected dataset named PKU-ER. The experimental results show that the proposed model achieves state-of-the-art performance in audio-only, video-only, and audio-visual fusion experiments.
引用
收藏
页码:315 / 326
页数:12
相关论文
共 50 条
  • [1] Audio-Visual Learning for Multimodal Emotion Recognition
    Fan, Siyu
    Jing, Jianan
    Wang, Chongwen
    SYMMETRY-BASEL, 2025, 17 (03):
  • [2] Multimodal Attentive Fusion Network for audio-visual event recognition
    Brousmiche, Mathilde
    Rouat, Jean
    Dupont, Stephane
    INFORMATION FUSION, 2022, 85 : 52 - 59
  • [3] Fish behavior recognition based on an audio-visual multimodal interactive fusion network
    Yang, Yuxin
    Yu, Hong
    Zhang, Xin
    Zhang, Peng
    Tu, Wan
    Gu, Lishuai
    AQUACULTURAL ENGINEERING, 2024, 107
  • [4] Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition
    Zhang, Shiqing
    Zhang, Shiliang
    Huang, Tiejun
    Gao, Wen
    ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 281 - 284
  • [5] Metric Learning-Based Multimodal Audio-Visual Emotion Recognition
    Ghaleb, Esam
    Popa, Mirela
    Asteriadis, Stylianos
    IEEE MULTIMEDIA, 2020, 27 (01) : 37 - 48
  • [6] Fusion of Classifier Predictions for Audio-Visual Emotion Recognition
    Noroozi, Fatemeh
    Marjanovic, Marina
    Njegus, Angelina
    Escalera, Sergio
    Anbarjafari, Gholamreza
    2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 61 - 66
  • [7] Multimodal and Temporal Perception of Audio-visual Cues for Emotion Recognition
    Ghaleb, Esam
    Popa, Mirela
    Asteriadis, Stylianos
    2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2019,
  • [8] Multimodal Emotion Recognition using Physiological and Audio-Visual Features
    Matsuda, Yuki
    Fedotov, Dmitrii
    Takahashi, Yuta
    Arakawa, Yutaka
    Yasumo, Keiichi
    Minker, Wolfgang
    PROCEEDINGS OF THE 2018 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING AND PROCEEDINGS OF THE 2018 ACM INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS (UBICOMP/ISWC'18 ADJUNCT), 2018, : 946 - 951
  • [9] Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities
    Middya, Asif Iqbal
    Nag, Baibhav
    Roy, Sarbani
    KNOWLEDGE-BASED SYSTEMS, 2022, 244
  • [10] Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities
    Middya, Asif Iqbal
    Nag, Baibhav
    Roy, Sarbani
    KNOWLEDGE-BASED SYSTEMS, 2022, 244