Audio-Guided Fusion Techniques for Multimodal Emotion Analysis

被引:0
|
作者
Shi, Pujin [1 ]
Gao, Fei [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Cyberspace Secur, State Key Lab Networking & Switching Technol, Beijing, Peoples R China
关键词
Multimodal emotion recognition; Multimodal feature fusion; Self-supervised learning; RECOGNITION; SLEEP;
D O I
10.1145/3689092.3689414
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose a solution for the semi-supervised learning track (MER-SEMI) in MER2024. First, in order to enhance the performance of the feature extractor on sentiment classification tasks, we fine-tuned video and text feature extractors, specifically CLIP-vit-large and Baichuan-13B, using labeled data. This approach effectively preserves the original emotional information conveyed in the videos. Second, we propose an Audio-Guided Transformer (AGT) fusion mechanism, which leverages the robustness of Hubert-large, showing superior effectiveness in fusing both inter-channel and intra-channel information. Third, To enhance the accuracy of the model, we iteratively apply self-supervised learning by using high-confidence unlabeled data as pseudo-labels. Finally, through black-box probing, we discovered an imbalanced data distribution between the training and test sets. Therefore, We adopt a prior-knowledge-based voting mechanism. The results demonstrate the effectiveness of our strategy, ultimately earning us third place in the MER-SEMI track.
引用
收藏
页码:62 / 66
页数:5
相关论文
共 50 条
  • [1] Audio-guided blind biopsy needle placement
    Wegner, K
    Karron, DB
    MEDICINE MEETS VIRTUAL REALITY: ART, SCIENCE, TECHNOLOGY: HEALTHCARE (R)EVOLUTION TM, 1998, 50 : 90 - 95
  • [2] Audio-Guided Video-Based Face Recognition
    Tang, Xiaoou
    Li, Zhifeng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2009, 19 (07) : 955 - 964
  • [3] MULTIMODAL INFORMATION FUSION OF AUDIO EMOTION RECOGNITION BASED ON KERNEL ENTROPY COMPONENT ANALYSIS
    Xie, Zhibing
    Guan, Ling
    INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2013, 7 (01) : 25 - 42
  • [4] Multimodal Information Fusion of Audio Emotion Recognition Based on Kernel Entropy Component Analysis
    Xie, Zhibing
    Guan, Ling
    2012 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2012, : 1 - 8
  • [5] Audio-guided audiovisual data segmentation, indexing, and retrieval
    Zhang, T
    Kuo, CCJ
    STORAGE AND RETRIEVAL FOR IMAGE AND VIDEO DATABASES VII, 1998, 3656 : 316 - 327
  • [6] Review on Multimodal Fusion Techniques for Human Emotion Recognition
    Karani, Ruhina
    Desai, Sharmishta
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (10) : 287 - 296
  • [7] Audio-Video Fusion with Double Attention for Multimodal Emotion Recognition
    Mocanu, Bogdan
    Tapu, Ruxandra
    2022 IEEE 14TH IMAGE, VIDEO, AND MULTIDIMENSIONAL SIGNAL PROCESSING WORKSHOP (IVMSP), 2022,
  • [8] Sound in media: audio drama and audio-guided tours as stimuli for the creation of place
    Wissmann, Torsten
    Zimmermann, Stefan
    GEOJOURNAL, 2015, 80 (06) : 803 - 810
  • [9] Audio-guided implicit neural representation for local image stylization
    Lee, Seung Hyun
    Kim, Sieun
    Byeon, Wonmin
    Oh, Gyeongrok
    In, Sumin
    Park, Hyeongcheol
    Yoon, Sang Ho
    Hong, Sung-Hee
    Kim, Jinkyu
    Kim, Sangpil
    COMPUTATIONAL VISUAL MEDIA, 2024, 10 (06) : 1185 - 1204
  • [10] Audio-guided Video Interpolation via Human Pose Features
    Nakatsuka, Takayuki
    Hamanaka, Masatoshi
    Morishima, Shigeo
    PROCEEDINGS OF THE 15TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS, VOL 5: VISAPP, 2020, : 27 - 35