Multi-level Multi-task representation learning with adaptive fusion for multimodal sentiment analysis

被引：0

作者：

Chuanbo Zhu ^{[1
]}

Min Chen ^{[2
]}

Haomin Li ^{[3
]}

Sheng Zhang ^{[1
]}

Han Liang ^{[1
]}

Chao Sun ^{[1
]}

Yifan Liu ^{[1
]}

Jincai Chen ^{[1
]}

机构：

[1] Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Hubei, Wuhan

[2] School of Computer Science and Engineering, South China University of Technology, Guangdong, Guangzhou

[3] Pazhou Laboratory, Guangdong, Guangzhou

[4] School of Computer Science and Technology, Huazhong University of Science and Technology, Hubei, Wuhan

[5] Key Laboratory of Information Storage System, Ministry of Education of China, Hubei, Wuhan

来源：

Neural Computing and Applications | 2025年 / 37卷 / 3期

基金：

中国国家自然科学基金;

关键词：

Multi-level representation; Multi-task learning; Multimodal adaptive fusion; Multimodal sentiment analysis;

D O I：

10.1007/s00521-024-10678-1

中图分类号：

学科分类号：

摘要：

Multimodal sentiment analysis is an active task in multimodal intelligence, which aims to compute the user’s sentiment tendency from multimedia data. Generally, each modality is a specific and necessary perspective to express human sentiment, providing complementary and consensus information unavailable in a single modality. Nevertheless, the heterogeneous multimedia data often contain inconsistent and conflicting sentiment semantics that limits the model performance. In this work, we propose a Multi-level Multi-task Representation Learning with Adaptive Fusion (MuReLAF) network to bridge the semantic gap among different modalities. Specifically, we design a modality adaptive fusion block to adjust modality contributions dynamically. Besides, we build a multi-level multimodal representations framework to obtain modality-specific and modality-shared semantics by the multi-task learning strategy, where modality-specific semantics contain complementary information and modality-shared semantics include consensus information. Extensive experiments are conducted on four publicly available datasets: MOSI, MOSEI, SIMS, and SIMSV2(s), demonstrating that our model exhibits superior or comparable performance to state-of-the-art models. The achieved accuracies are 86.28%, 86.07%, 84.46%, and 82.78%, respectively, showcasing improvements of 0.82%, 0.84%, 1.75%, and 1.83%. Further analyses also indicate the effectiveness of our model in sentiment analysis. © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024.

引用

页码：1491 / 1508

页数：17

共 61 条

[21] Ma Y., Hao Y., Chen M., Et al., Audio-visual emotion fusion (avef): a deep efficient weighted approach, Inf Fusion, 46, pp. 184-192, (2019)
[22] Zadeh A., Chen M., Poria S., Tensor fusion network for multimodal sentiment analysis, In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing., pp. 1103-1114, (2017)
[23] Liu Z., Shen Y., Lakshminarasimhan V.B., Efficient Low-Rank Multimodal Fusion with Modality-Specific Factors, (2018)
[24] Jin T., Huang S., Li Y., Dual low-rank multimodal fusion, . In: Findings of the Association for Computational Linguistics: EMNLP 2020., pp. 377-387, (2020)
[25] Fu Z., Liu F., Xu Q., Et al., Nhfnet: a non-homogeneous fusion network for multimodal sentiment analysis, 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1-6, (2022)
[26] Kim K., Park S., Aobert: all-modalities-in-one bert for multimodal sentiment analysis, Inf Fusion, 92, pp. 37-45, (2023)
[27] Zadeh A., Liang P.P., Poria S., Et al., Multi-attention recurrent network for human communication comprehension, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, (2018)
[28] Zadeh A., Liang P.P., Mazumder N., Memory fusion network for multi-view sequential learning., (2018)
[29] Bagher Zadeh A., Liang P.P., Poria S., Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph, In: Proceedings of the 56Th Annual Meeting of the Association for Computational Linguistics, 1, pp. 2236-2246, (2018)
[30] Wang Y., Shen Y., Liu Z., Et al., Words can shift: Dynamically adjusting word representations using nonverbal behaviors, Proceedings of the AAAI Conference on Artificial Intelligence, 33, 1, pp. 7216-7223, (2019)

← 1 2 3 4 5 6 7 →