Multi-level Multi-task representation learning with adaptive fusion for multimodal sentiment analysis

被引:0
作者
Chuanbo Zhu [1 ]
Min Chen [2 ]
Haomin Li [3 ]
Sheng Zhang [1 ]
Han Liang [1 ]
Chao Sun [1 ]
Yifan Liu [1 ]
Jincai Chen [1 ]
机构
[1] Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Hubei, Wuhan
[2] School of Computer Science and Engineering, South China University of Technology, Guangdong, Guangzhou
[3] Pazhou Laboratory, Guangdong, Guangzhou
[4] School of Computer Science and Technology, Huazhong University of Science and Technology, Hubei, Wuhan
[5] Key Laboratory of Information Storage System, Ministry of Education of China, Hubei, Wuhan
基金
中国国家自然科学基金;
关键词
Multi-level representation; Multi-task learning; Multimodal adaptive fusion; Multimodal sentiment analysis;
D O I
10.1007/s00521-024-10678-1
中图分类号
学科分类号
摘要
Multimodal sentiment analysis is an active task in multimodal intelligence, which aims to compute the user’s sentiment tendency from multimedia data. Generally, each modality is a specific and necessary perspective to express human sentiment, providing complementary and consensus information unavailable in a single modality. Nevertheless, the heterogeneous multimedia data often contain inconsistent and conflicting sentiment semantics that limits the model performance. In this work, we propose a Multi-level Multi-task Representation Learning with Adaptive Fusion (MuReLAF) network to bridge the semantic gap among different modalities. Specifically, we design a modality adaptive fusion block to adjust modality contributions dynamically. Besides, we build a multi-level multimodal representations framework to obtain modality-specific and modality-shared semantics by the multi-task learning strategy, where modality-specific semantics contain complementary information and modality-shared semantics include consensus information. Extensive experiments are conducted on four publicly available datasets: MOSI, MOSEI, SIMS, and SIMSV2(s), demonstrating that our model exhibits superior or comparable performance to state-of-the-art models. The achieved accuracies are 86.28%, 86.07%, 84.46%, and 82.78%, respectively, showcasing improvements of 0.82%, 0.84%, 1.75%, and 1.83%. Further analyses also indicate the effectiveness of our model in sentiment analysis. © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024.
引用
收藏
页码:1491 / 1508
页数:17
相关论文
共 61 条
  • [1] Ouyang L., Wu J., Jiang X., Training Language Models to Follow Instructions with Human Feedback., (2022)
  • [2] Touvron H., Lavril T., Izacard G., Llama: Open and Efficient Foundation Language Models., (2023)
  • [3] Bubeck S., Chandrasekaran V., Eldan R., Sparks of artificial general intelligence: early experiments with gpt-4., (2023)
  • [4] Gpt-4 Technical Report, (2023)
  • [5] Zhao W., Zhao Y., Lu X., ) is Chatgpt Equipped with Emotional Dialogue Capabilities?, (2023)
  • [6] Al-Qablan T.A., Mohd Noor M.H., Al-Betar M.A., Et al., A survey on sentiment analysis and its applications, Neural Computing and Applications, pp. 1-35, (2023)
  • [7] Sun H., Wang H., Liu J., Et al., Cubemlp: An mlp-based model for multimodal sentiment analysis and depression estimation, In: Proceedings of the 30Th ACM International Conference on Multimedia, pp. 3722-3729, (2022)
  • [8] Baltrusaitis T., Ahuja C., Morency L.P., Multimodal machine learning: a survey and taxonomy, IEEE Trans Pattern Anal Mach Intell, 41, 2, pp. 423-443, (2019)
  • [9] Hazarika D., Zimmermann R., Poria S., Misa: Modality-invariant and -specific representations for multimodal sentiment analysis, . In: Proceedings of the 28Th ACM International Conference on Multimedia, 20, pp. 1122-1131, (2020)
  • [10] Liu Y., Yuan Z., Mao H., Et al., Make acoustic and visual cues matter: Ch-sims v2.0 dataset and av-mixup consistent module, In: Proceedings of the 2022 International Conference on Multimodal Interaction., pp. 247-258, (2022)