Audio-Guided Fusion Techniques for Multimodal Emotion Analysis

被引:0
|
作者
Shi, Pujin [1 ]
Gao, Fei [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Cyberspace Secur, State Key Lab Networking & Switching Technol, Beijing, Peoples R China
关键词
Multimodal emotion recognition; Multimodal feature fusion; Self-supervised learning; RECOGNITION; SLEEP;
D O I
10.1145/3689092.3689414
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose a solution for the semi-supervised learning track (MER-SEMI) in MER2024. First, in order to enhance the performance of the feature extractor on sentiment classification tasks, we fine-tuned video and text feature extractors, specifically CLIP-vit-large and Baichuan-13B, using labeled data. This approach effectively preserves the original emotional information conveyed in the videos. Second, we propose an Audio-Guided Transformer (AGT) fusion mechanism, which leverages the robustness of Hubert-large, showing superior effectiveness in fusing both inter-channel and intra-channel information. Third, To enhance the accuracy of the model, we iteratively apply self-supervised learning by using high-confidence unlabeled data as pseudo-labels. Finally, through black-box probing, we discovered an imbalanced data distribution between the training and test sets. Therefore, We adopt a prior-knowledge-based voting mechanism. The results demonstrate the effectiveness of our strategy, ultimately earning us third place in the MER-SEMI track.
引用
收藏
页码:62 / 66
页数:5
相关论文
共 50 条
  • [31] Human Emotion Detection with Electroencephalography Signals and Accuracy Analysis Using Feature Fusion Techniques and a Multimodal Approach for Multiclass Classification
    Kimmatkar, Nisha Vishnnupant
    Babu, B. Vijaya
    ENGINEERING TECHNOLOGY & APPLIED SCIENCE RESEARCH, 2022, 12 (04) : 9012 - 9017
  • [32] Emotion-Aware Multimodal Fusion for Meme Emotion Detection
    Sharma, Shivam
    Ramaneswaran, S.
    Akhtar, Md. Shad
    Chakraborty, Tanmoy
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2024, 15 (03) : 1800 - 1811
  • [33] Exploring Co-Occurence Between Speech and Body Movement for Audio-Guided Video Localization
    Vajaria, Himanshu
    Sarkar, Sudeep
    Kasturi, Rangachar
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2008, 18 (11) : 1608 - 1617
  • [34] APB2FACE: AUDIO-GUIDED FACE REENACTMENT WITH AUXILIARY POSE AND BLINK SIGNALS
    Zhang, Jiangning
    Liu, Liang
    Xue, Zhucun
    Liu, Yong
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4402 - 4406
  • [35] Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning
    Mocanu, Bogdan
    Tapu, Ruxandra
    Zaharia, Titus
    IMAGE AND VISION COMPUTING, 2023, 133
  • [36] Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities
    Middya A.I.
    Nag B.
    Roy S.
    Knowledge-Based Systems, 2022, 244
  • [37] Multimodal Emotion Recognition Based on Feature Fusion
    Xu, Yurui
    Wu, Xiao
    Su, Hang
    Liu, Xiaorui
    2022 INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM 2022), 2022, : 7 - 11
  • [38] MULTIMODAL TRANSFORMER FUSION FOR CONTINUOUS EMOTION RECOGNITION
    Huang, Jian
    Tao, Jianhua
    Liu, Bin
    Lian, Zheng
    Niu, Mingyue
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3507 - 3511
  • [39] Multimodal Transformer Fusion for Emotion Recognition: A Survey
    Belaref, Amdjed
    Seguier, Renaud
    2024 6TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING, ICNLP 2024, 2024, : 107 - 113
  • [40] Fusion with Hierarchical Graphs for Multimodal Emotion Recognition
    Tang, Shuyun
    Luo, Zhaojie
    Nan, Guoshun
    Baba, Jun
    Yoshikawa, Yuichiro
    Ishiguro, Hiroshi
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1288 - 1296