AVaTER: Fusing Audio, Visual, and Textual Modalities Using Cross-Modal Attention for Emotion Recognition

被引:1
|
作者
Das, Avishek [1 ]
Sarma, Moumita Sen [1 ]
Hoque, Mohammed Moshiul [1 ]
Siddique, Nazmul [2 ]
Dewan, M. Ali Akber [3 ]
机构
[1] Chittagong Univ Engn & Technol, Dept Comp Sci & Engn, Chittagong 4349, Bangladesh
[2] Ulster Univ, Sch Comp Engn & Intelligent Syst, Belfast BT15 1AP, North Ireland
[3] Athabasca Univ, Fac Sci & Technol, Sch Comp & Informat Syst, Athabasca, AB T9S 3A3, Canada
关键词
multimodal emotion recognition; natural language processing; multimodal dataset; cross-modal attention; transformers;
D O I
10.3390/s24185862
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Multimodal emotion classification (MEC) involves analyzing and identifying human emotions by integrating data from multiple sources, such as audio, video, and text. This approach leverages the complementary strengths of each modality to enhance the accuracy and robustness of emotion recognition systems. However, one significant challenge is effectively integrating these diverse data sources, each with unique characteristics and levels of noise. Additionally, the scarcity of large, annotated multimodal datasets in Bangla limits the training and evaluation of models. In this work, we unveiled a pioneering multimodal Bangla dataset, MAViT-Bangla (Multimodal Audio Video Text Bangla dataset). This dataset, comprising 1002 samples across audio, video, and text modalities, is a unique resource for emotion recognition studies in the Bangla language. It features emotional categories such as anger, fear, joy, and sadness, providing a comprehensive platform for research. Additionally, we developed a framework for audio, video and textual emotion recognition (i.e., AVaTER) that employs a cross-modal attention mechanism among unimodal features. This mechanism fosters the interaction and fusion of features from different modalities, enhancing the model's ability to capture nuanced emotional cues. The effectiveness of this approach was demonstrated by achieving an F1-score of 0.64, a significant improvement over unimodal methods.
引用
收藏
页数:23
相关论文
共 34 条
  • [31] Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects
    Zhang, Shiqing
    Yang, Yijiao
    Chen, Chen
    Zhang, Xingnan
    Leng, Qingming
    Zhao, Xiaoming
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 237
  • [32] Affective Image Captioning for Visual Artworks Using Emotion-Based Cross-Attention Mechanisms
    Ishikawa, Shintaro
    Sugiura, Komei
    IEEE ACCESS, 2023, 11 : 24527 - 24534
  • [33] Cross-modal attention influences auditory contrast sensitivity: Decreasing visual load improves auditory thresholds for amplitude-and frequency-modulated sounds
    Ciaramitaro, Vivian M.
    Chow, Hiu Mei
    Eglington, Luke G.
    JOURNAL OF VISION, 2017, 17 (03):
  • [34] Audio-Visual Group-based Emotion Recognition using Local and Global Feature Aggregation based Multi-Task Learning
    Li, Sunan
    Lian, Hailun
    Lu, Cheng
    Zhao, Yan
    Tang, Chuangao
    Zong, Yuan
    Zheng, Wenming
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, 2023, : 741 - 745