Multimodal Emotion Recognition with Vision-language Prompting and Modality Dropout

被引:0
作者
Qi, Anbin [1 ]
Liu, Zhongliang [1 ]
Zhou, Xinyong [1 ]
Xiao, Jinba [1 ]
Zhang, Fengrun [1 ]
Gan, Qi [1 ]
Tao, Ming [1 ]
Zhang, Gaozheng [1 ]
Zhang, Lu [1 ]
机构
[1] Shanghai Soulgate Technol Co Ltd, Soul AI, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 2ND INTERNATIONAL WORKSHOP ON MULTIMODAL AND RESPONSIBLE AFFECTIVE COMPUTING, MRAC 2024 | 2024年
关键词
MER; 2024; multimodal emotion recognition; fine-tuning CLIP; modality dropout; semi-supervised learning;
D O I
10.1145/3689092.3689401
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we present our solution for the Second Multimodal Emotion Recognition Challenge Track 1 (MER2024-SEMI). To enhance the accuracy and generalization performance of emotion recognition, we propose several methods for Multimodal Emotion Recognition. Firstly, we introduce EmoVCLIP, a model fine-tuned based on CLIP using vision-language prompt learning, designed for video-based emotion recognition tasks. By leveraging prompt learning on CLIP, EmoVCLIP improves the performance of pre-trained CLIP on emotional videos. Additionally, to address the issue of modality dependence in multimodal fusion, we employ modality dropout for robust information fusion. Furthermore, to aid Baichuan in better extracting emotional information, we suggest using GPT-4 as the prompt for Baichuan. Lastly, we utilize a self-training strategy to leverage unlabeled videos. In this process, we use unlabeled videos with high-confidence pseudo-labels generated by our model and incorporate them into the training set. Experimental results demonstrate that our model ranks 1st in the MER2024-SEMI track, achieving an accuracy of 90.15% on the test set.
引用
收藏
页码:49 / 53
页数:5
相关论文
共 26 条
  • [1] Achiam OJ, 2023, Arxiv, DOI [arXiv:2303.08774, 10.48550/arXiv.2303.08774, DOI 10.48550/ARXIV.2303.08774]
  • [2] Baevski A, 2020, ADV NEUR IN, V33
  • [3] Semi-Supervised Multimodal Emotion Recognition with Class-Balanced Pseudo-Labeling
    Chen, Haifeng
    Guo, Chujia
    Li, Yan
    Zhang, Peng
    Jiang, Dongmei
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9556 - 9560
  • [4] Clark K, 2020, Arxiv, DOI [arXiv:2003.10555, DOI 10.48550/ARXIV.2003.10555]
  • [5] Dai ZH, 2019, Arxiv, DOI arXiv:1901.02860
  • [6] Devlin J, 2019, Arxiv, DOI [arXiv:1810.04805, DOI 10.48550/ARXIV.1810.04805]
  • [7] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
    Hsu, Wei-Ning
    Bolte, Benjamin
    Tsai, Yao-Hung Hubert
    Lakhotia, Kushal
    Salakhutdinov, Ruslan
    Mohamed, Abdelrahman
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3451 - 3460
  • [8] Hsu Wei-Ning, 2022, Advances in Neural Information Processing Systems, V35, P21157
  • [9] Huang Yu, 2022, PMLR, P9226
  • [10] MaPLe: Multi-modal Prompt Learning
    Khattak, Muhammad Uzair
    Rasheed, Hanoona
    Maaz, Muhammad
    Khan, Salman
    Khan, Fahad Shahbaz
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19113 - 19122