Audio-Guided Fusion Techniques for Multimodal Emotion Analysis

被引:0
|
作者
Shi, Pujin [1 ]
Gao, Fei [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Cyberspace Secur, State Key Lab Networking & Switching Technol, Beijing, Peoples R China
关键词
Multimodal emotion recognition; Multimodal feature fusion; Self-supervised learning; RECOGNITION; SLEEP;
D O I
10.1145/3689092.3689414
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose a solution for the semi-supervised learning track (MER-SEMI) in MER2024. First, in order to enhance the performance of the feature extractor on sentiment classification tasks, we fine-tuned video and text feature extractors, specifically CLIP-vit-large and Baichuan-13B, using labeled data. This approach effectively preserves the original emotional information conveyed in the videos. Second, we propose an Audio-Guided Transformer (AGT) fusion mechanism, which leverages the robustness of Hubert-large, showing superior effectiveness in fusing both inter-channel and intra-channel information. Third, To enhance the accuracy of the model, we iteratively apply self-supervised learning by using high-confidence unlabeled data as pseudo-labels. Finally, through black-box probing, we discovered an imbalanced data distribution between the training and test sets. Therefore, We adopt a prior-knowledge-based voting mechanism. The results demonstrate the effectiveness of our strategy, ultimately earning us third place in the MER-SEMI track.
引用
收藏
页码:62 / 66
页数:5
相关论文
共 50 条
  • [21] Emotion Detection in Multimodal Communication through Audio-Visual Gesture Analysis
    Minu, R., I
    Aamuktha, Divya Sai P.
    Ishita, B.
    Anubhav, P.
    Kumar, Tanishq
    Jayaram, Ramaprabha
    Karjee, Jyotirmoy
    10TH INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTING AND COMMUNICATION TECHNOLOGIES, CONECCT 2024, 2024,
  • [22] Deep Fusion: An Attention Guided Factorized Bilinear Pooling for Audio-video Emotion Recognition
    Zhang, Yuanyuan
    Wang, Zi-Rui
    Du, Jun
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [23] Comparing Virtual Reality, Video, and Audio-Guided Meditations in Fostering Positive Attitudes toward Meditation
    Douglas A. Gentile
    El-Lim Kim
    Mindfulness, 2024, 15 : 586 - 600
  • [24] A Brief Analysis of Multimodal Medical Image Fusion Techniques
    Saleh, Mohammed Ali
    Ali, AbdElmgeid A. A.
    Ahmed, Kareem
    Sarhan, Abeer M. M.
    ELECTRONICS, 2023, 12 (01)
  • [25] Effective Techniques for Multimodal Data Fusion: A Comparative Analysis
    Pawlowski, Maciej
    Wroblewska, Anna
    Sysko-Romanczuk, Sylwia
    SENSORS, 2023, 23 (05)
  • [26] Deep Multimodal Fusion for Depression Detection: Integrating Facial Emotion Recognition, EEG Signals and Audio Cues
    Thirunavukkarasu, J.
    Jebamathi, Shiny M.
    Varshaa, P.
    Nisha, M.
    Sri, Nanthitha M.
    2024 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATION AND APPLIED INFORMATICS, ACCAI 2024, 2024,
  • [27] MULTIMODAL SPEECH EMOTION RECOGNITION USING AUDIO AND TEXT
    Yoon, Seunghyun
    Byun, Seokhyun
    Jung, Kyomin
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 112 - 118
  • [28] Audio-Visual Learning for Multimodal Emotion Recognition
    Fan, Siyu
    Jing, Jianan
    Wang, Chongwen
    SYMMETRY-BASEL, 2025, 17 (03):
  • [29] Comparing Virtual Reality, Video, and Audio-Guided Meditations in Fostering Positive Attitudes toward Meditation
    Gentile, Douglas A.
    Kim, El-Lim
    MINDFULNESS, 2024, 15 (03) : 586 - 600
  • [30] Audio-guided mindfulness training in schools and its effect on academic attainment: Contributing to theory and practice
    Bakosh, Laura S.
    Mortlock, Jutta M. Tobias
    Querstret, Dawn
    Morison, Linda
    LEARNING AND INSTRUCTION, 2018, 58 : 34 - 41