Building Robust Multimodal Sentiment Recognition via a Simple yet Effective Multimodal Transformer

被引:2
作者
Zong, Daoming [1 ]
Ding, Chaoyue [1 ]
Li, Baoxiang [1 ]
Zhou, Dinghao [1 ]
Li, Jiakui [1 ]
Zheng, Ken [1 ]
Zhou, Qunyan [1 ]
机构
[1] SenseTime Grp Ltd, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
关键词
Multimodal Sentiment Analysis; Multimodal Fusion; Modality Robustness;
D O I
10.1145/3581783.3612872
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we present the solutions to the MER-MULTI and MER-NOISE sub-challenges of the Multimodal Emotion Recognition Challenge (MER 2023). For the tasks MER-MULTI and MERNOISE, participants are required to recognize both discrete and dimensional emotions. Particularly, in MER-NOISE, the test videos are corrupted with noise, necessitating the consideration of modality robustness. Our empirical findings indicate that different modalities contribute differently to the tasks, with a significant impact from the audio and visual modalities, while the text modality plays a weaker role in emotion prediction. To facilitate subsequent multimodal fusion, and considering that language information is implicitly embedded in large pre-trained speech models, we have made the deliberate choice to abandon the text modality and solely utilize visual and acoustic modalities for these sub-challenges. To address the potential underfitting of individual modalities during multimodal training, we propose to jointly train all modalities via a weighted blending of supervision signals. Furthermore, to enhance the robustness of our model, we employ a range of data augmentation techniques at the image level, waveform level, and spectrogram level. Experimental results show that our model ranks 1st in both MER-MULTI (0.7005) and MER-NOISE (0.6846) subchallenges, validating the effectiveness of our method. Our code is publicly available at https://github.com/dingchaoyue/MultimodalEmotion- Recognition-MER- and- MuSe- 2023- Challenges.
引用
收藏
页码:9596 / 9600
页数:5
相关论文
共 35 条
[1]  
Baevski A., 2020, wav2vec 2.0: A framework for self-supervised learning of speech representations
[2]   Multimodal Machine Learning: A Survey and Taxonomy [J].
Baltrusaitis, Tadas ;
Ahuja, Chaitanya ;
Morency, Louis-Philippe .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) :423-443
[3]   CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].
Chen, Chun-Fu ;
Fan, Quanfu ;
Panda, Rameswar .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356
[4]  
Cheng JY, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P2447
[5]  
Christ Lukas, 2023, ARXIV230503369
[6]   SPEED-ROBUST KEYWORD SPOTTING VIA SOFT SELF-ATTENTION ON MULTI-SCALE FEATURES [J].
Ding, Chaoyue ;
Li, Jiakui ;
Zong, Martin ;
Li, Baoxiang .
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, :1014-1021
[7]   Stable Speech Emotion Recognition with Head-k-Pooling Loss [J].
Ding, Chaoyue ;
Li, Jiakui ;
Zong, Daoming ;
Li, Baoxiang ;
Zhang, Tianhao ;
Zhou, Qunyan .
INTERSPEECH 2023, 2023, :661-665
[8]   LETR: A LIGHTWEIGHT AND EFFICIENT TRANSFORMER FOR KEYWORD SPOTTING [J].
Ding, Kevin ;
Zong, Martin ;
Li, Jiakui ;
Li, Baoxiang .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :7987-7991
[9]  
Dosovitskiy A., 2020, ICLR 2021
[10]  
Goodfellow Ian J., 2013, Neural Information Processing. 20th International Conference, ICONIP 2013. Proceedings: LNCS 8228, P117, DOI 10.1007/978-3-642-42051-1_16