Multi-level Multi-task representation learning with adaptive fusion for multimodal sentiment analysis

被引：0

作者：

Chuanbo Zhu ^{[1
]}

Min Chen ^{[2
]}

Haomin Li ^{[3
]}

Sheng Zhang ^{[1
]}

Han Liang ^{[1
]}

Chao Sun ^{[1
]}

Yifan Liu ^{[1
]}

Jincai Chen ^{[1
]}

机构：

[1] Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Hubei, Wuhan

[2] School of Computer Science and Engineering, South China University of Technology, Guangdong, Guangzhou

[3] Pazhou Laboratory, Guangdong, Guangzhou

[4] School of Computer Science and Technology, Huazhong University of Science and Technology, Hubei, Wuhan

[5] Key Laboratory of Information Storage System, Ministry of Education of China, Hubei, Wuhan

来源：

Neural Computing and Applications | 2025年 / 37卷 / 3期

基金：

中国国家自然科学基金;

关键词：

Multi-level representation; Multi-task learning; Multimodal adaptive fusion; Multimodal sentiment analysis;

D O I：

10.1007/s00521-024-10678-1

中图分类号：

学科分类号：

摘要：

Multimodal sentiment analysis is an active task in multimodal intelligence, which aims to compute the user’s sentiment tendency from multimedia data. Generally, each modality is a specific and necessary perspective to express human sentiment, providing complementary and consensus information unavailable in a single modality. Nevertheless, the heterogeneous multimedia data often contain inconsistent and conflicting sentiment semantics that limits the model performance. In this work, we propose a Multi-level Multi-task Representation Learning with Adaptive Fusion (MuReLAF) network to bridge the semantic gap among different modalities. Specifically, we design a modality adaptive fusion block to adjust modality contributions dynamically. Besides, we build a multi-level multimodal representations framework to obtain modality-specific and modality-shared semantics by the multi-task learning strategy, where modality-specific semantics contain complementary information and modality-shared semantics include consensus information. Extensive experiments are conducted on four publicly available datasets: MOSI, MOSEI, SIMS, and SIMSV2(s), demonstrating that our model exhibits superior or comparable performance to state-of-the-art models. The achieved accuracies are 86.28%, 86.07%, 84.46%, and 82.78%, respectively, showcasing improvements of 0.82%, 0.84%, 1.75%, and 1.83%. Further analyses also indicate the effectiveness of our model in sentiment analysis. © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024.

引用

页码：1491 / 1508

页数：17

共 61 条

[1] Ouyang L., Wu J., Jiang X., Training Language Models to Follow Instructions with Human Feedback., (2022)
[2] Touvron H., Lavril T., Izacard G., Llama: Open and Efficient Foundation Language Models., (2023)
[3] Bubeck S., Chandrasekaran V., Eldan R., Sparks of artificial general intelligence: early experiments with gpt-4., (2023)
[4] Gpt-4 Technical Report, (2023)
[5] Zhao W., Zhao Y., Lu X., ) is Chatgpt Equipped with Emotional Dialogue Capabilities?, (2023)
[6] Al-Qablan T.A., Mohd Noor M.H., Al-Betar M.A., Et al., A survey on sentiment analysis and its applications, Neural Computing and Applications, pp. 1-35, (2023)
[7] Sun H., Wang H., Liu J., Et al., Cubemlp: An mlp-based model for multimodal sentiment analysis and depression estimation, In: Proceedings of the 30Th ACM International Conference on Multimedia, pp. 3722-3729, (2022)
[8] Baltrusaitis T., Ahuja C., Morency L.P., Multimodal machine learning: a survey and taxonomy, IEEE Trans Pattern Anal Mach Intell, 41, 2, pp. 423-443, (2019)
[9] Hazarika D., Zimmermann R., Poria S., Misa: Modality-invariant and -specific representations for multimodal sentiment analysis, . In: Proceedings of the 28Th ACM International Conference on Multimedia, 20, pp. 1122-1131, (2020)
[10] Liu Y., Yuan Z., Mao H., Et al., Make acoustic and visual cues matter: Ch-sims v2.0 dataset and av-mixup consistent module, In: Proceedings of the 2022 International Conference on Multimodal Interaction., pp. 247-258, (2022)

← 1 2 3 4 5 6 7 →