Building Robust Multimodal Sentiment Recognition via a Simple yet Effective Multimodal Transformer

被引：2

作者：

Zong, Daoming ^{[1
]}

Ding, Chaoyue ^{[1
]}

Li, Baoxiang ^{[1
]}

Zhou, Dinghao ^{[1
]}

Li, Jiakui ^{[1
]}

Zheng, Ken ^{[1
]}

Zhou, Qunyan ^{[1
]}

机构：

[1] SenseTime Grp Ltd, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

关键词：

Multimodal Sentiment Analysis; Multimodal Fusion; Modality Robustness;

D O I：

10.1145/3581783.3612872

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we present the solutions to the MER-MULTI and MER-NOISE sub-challenges of the Multimodal Emotion Recognition Challenge (MER 2023). For the tasks MER-MULTI and MERNOISE, participants are required to recognize both discrete and dimensional emotions. Particularly, in MER-NOISE, the test videos are corrupted with noise, necessitating the consideration of modality robustness. Our empirical findings indicate that different modalities contribute differently to the tasks, with a significant impact from the audio and visual modalities, while the text modality plays a weaker role in emotion prediction. To facilitate subsequent multimodal fusion, and considering that language information is implicitly embedded in large pre-trained speech models, we have made the deliberate choice to abandon the text modality and solely utilize visual and acoustic modalities for these sub-challenges. To address the potential underfitting of individual modalities during multimodal training, we propose to jointly train all modalities via a weighted blending of supervision signals. Furthermore, to enhance the robustness of our model, we employ a range of data augmentation techniques at the image level, waveform level, and spectrogram level. Experimental results show that our model ranks 1st in both MER-MULTI (0.7005) and MER-NOISE (0.6846) subchallenges, validating the effectiveness of our method. Our code is publicly available at https://github.com/dingchaoyue/MultimodalEmotion- Recognition-MER- and- MuSe- 2023- Challenges.

引用

页码：9596 / 9600

页数：5

共 35 条

[1]

Baevski A., 2020, wav2vec 2.0: A framework for self-supervised learning of speech representations

[2] Multimodal Machine Learning: A Survey and Taxonomy [J].

Baltrusaitis, Tadas ;

Ahuja, Chaitanya ;

Morency, Louis-Philippe .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) :423-443

[3] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].

Chen, Chun-Fu ;

Fan, Quanfu ;

Panda, Rameswar .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356

[4]

Cheng JY, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P2447

[5]

Christ Lukas, 2023, ARXIV230503369

[6] SPEED-ROBUST KEYWORD SPOTTING VIA SOFT SELF-ATTENTION ON MULTI-SCALE FEATURES [J].

Ding, Chaoyue ;

Li, Jiakui ;

Zong, Martin ;

Li, Baoxiang .

2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, :1014-1021

[7] Stable Speech Emotion Recognition with Head-k-Pooling Loss [J].

Ding, Chaoyue ;

Li, Jiakui ;

Zong, Daoming ;

Li, Baoxiang ;

Zhang, Tianhao ;

Zhou, Qunyan .

INTERSPEECH 2023, 2023, :661-665

[8] LETR: A LIGHTWEIGHT AND EFFICIENT TRANSFORMER FOR KEYWORD SPOTTING [J].

Ding, Kevin ;

Zong, Martin ;

Li, Jiakui ;

Li, Baoxiang .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :7987-7991

[9]

Dosovitskiy A., 2020, ICLR 2021

[10]

Goodfellow Ian J., 2013, Neural Information Processing. 20th International Conference, ICONIP 2013. Proceedings: LNCS 8228, P117, DOI 10.1007/978-3-642-42051-1_16

← 1 2 3 4 →