AVaTER: Fusing Audio, Visual, and Textual Modalities Using Cross-Modal Attention for Emotion Recognition

被引:1
|
作者
Das, Avishek [1 ]
Sarma, Moumita Sen [1 ]
Hoque, Mohammed Moshiul [1 ]
Siddique, Nazmul [2 ]
Dewan, M. Ali Akber [3 ]
机构
[1] Chittagong Univ Engn & Technol, Dept Comp Sci & Engn, Chittagong 4349, Bangladesh
[2] Ulster Univ, Sch Comp Engn & Intelligent Syst, Belfast BT15 1AP, North Ireland
[3] Athabasca Univ, Fac Sci & Technol, Sch Comp & Informat Syst, Athabasca, AB T9S 3A3, Canada
关键词
multimodal emotion recognition; natural language processing; multimodal dataset; cross-modal attention; transformers;
D O I
10.3390/s24185862
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Multimodal emotion classification (MEC) involves analyzing and identifying human emotions by integrating data from multiple sources, such as audio, video, and text. This approach leverages the complementary strengths of each modality to enhance the accuracy and robustness of emotion recognition systems. However, one significant challenge is effectively integrating these diverse data sources, each with unique characteristics and levels of noise. Additionally, the scarcity of large, annotated multimodal datasets in Bangla limits the training and evaluation of models. In this work, we unveiled a pioneering multimodal Bangla dataset, MAViT-Bangla (Multimodal Audio Video Text Bangla dataset). This dataset, comprising 1002 samples across audio, video, and text modalities, is a unique resource for emotion recognition studies in the Bangla language. It features emotional categories such as anger, fear, joy, and sadness, providing a comprehensive platform for research. Additionally, we developed a framework for audio, video and textual emotion recognition (i.e., AVaTER) that employs a cross-modal attention mechanism among unimodal features. This mechanism fosters the interaction and fusion of features from different modalities, enhancing the model's ability to capture nuanced emotional cues. The effectiveness of this approach was demonstrated by achieving an F1-score of 0.64, a significant improvement over unimodal methods.
引用
收藏
页数:23
相关论文
共 34 条
  • [1] Emotion recognition using cross-modal attention from EEG and facial expression
    Cui, Rongxuan
    Chen, Wanzhong
    Li, Mingyang
    KNOWLEDGE-BASED SYSTEMS, 2024, 304
  • [2] Multimodal Emotion Recognition using Cross-Modal Attention and 1D Convolutional Neural Networks
    Krishna, D. N.
    Patil, Ankita
    INTERSPEECH 2020, 2020, : 4243 - 4247
  • [3] Mi-CGA: Cross-modal Graph Attention Network for robust emotion recognition in the presence of incomplete modalities
    Nguyen, Cam-Van Thi
    Kieu, Hai-Dang
    Ha, Quang-Thuy
    Phan, Xuan-Hieu
    Le, Duc-Trong
    NEUROCOMPUTING, 2025, 623
  • [4] Cross-modal orienting of visual attention
    Hillyard, Steven A.
    Stoermer, Viola S.
    Feng, Wenfeng
    Martinez, Antigona
    McDonald, John J.
    NEUROPSYCHOLOGIA, 2016, 83 : 170 - 178
  • [5] Audio-Visual Attention Networks for Emotion Recognition
    Lee, Jiyoung
    Kim, Sunok
    Kim, Seungryong
    Sohn, Kwanghoon
    AVSU'18: PROCEEDINGS OF THE 2018 WORKSHOP ON AUDIO-VISUAL SCENE UNDERSTANDING FOR IMMERSIVE MULTIMEDIA, 2018, : 27 - 32
  • [6] Speaker-aware cognitive network with cross-modal attention for multimodal emotion recognition in conversation
    Guo, Lili
    Song, Yikang
    Ding, Shifei
    KNOWLEDGE-BASED SYSTEMS, 2024, 296
  • [7] Multi-corpus emotion recognition method based on cross-modal gated attention fusion
    Ryumina, Elena
    Ryumin, Dmitry
    Axyonov, Alexandr
    Ivanko, Denis
    Karpov, Alexey
    PATTERN RECOGNITION LETTERS, 2025, 190 : 192 - 200
  • [9] Cross-Modal Dynamic Transfer Learning for Multimodal Emotion Recognition
    Hong, Soyeon
    Kang, Hyeoungguk
    Cho, Hyunsouk
    IEEE ACCESS, 2024, 12 : 14324 - 14333
  • [10] Contextual and Cross-Modal Interaction for Multi-Modal Speech Emotion Recognition
    Yang, Dingkang
    Huang, Shuai
    Liu, Yang
    Zhang, Lihua
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2093 - 2097