Multi-level Multi-task representation learning with adaptive fusion for multimodal sentiment analysis

被引:0
作者
Chuanbo Zhu [1 ]
Min Chen [2 ]
Haomin Li [3 ]
Sheng Zhang [1 ]
Han Liang [1 ]
Chao Sun [1 ]
Yifan Liu [1 ]
Jincai Chen [1 ]
机构
[1] Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Hubei, Wuhan
[2] School of Computer Science and Engineering, South China University of Technology, Guangdong, Guangzhou
[3] Pazhou Laboratory, Guangdong, Guangzhou
[4] School of Computer Science and Technology, Huazhong University of Science and Technology, Hubei, Wuhan
[5] Key Laboratory of Information Storage System, Ministry of Education of China, Hubei, Wuhan
基金
中国国家自然科学基金;
关键词
Multi-level representation; Multi-task learning; Multimodal adaptive fusion; Multimodal sentiment analysis;
D O I
10.1007/s00521-024-10678-1
中图分类号
学科分类号
摘要
Multimodal sentiment analysis is an active task in multimodal intelligence, which aims to compute the user’s sentiment tendency from multimedia data. Generally, each modality is a specific and necessary perspective to express human sentiment, providing complementary and consensus information unavailable in a single modality. Nevertheless, the heterogeneous multimedia data often contain inconsistent and conflicting sentiment semantics that limits the model performance. In this work, we propose a Multi-level Multi-task Representation Learning with Adaptive Fusion (MuReLAF) network to bridge the semantic gap among different modalities. Specifically, we design a modality adaptive fusion block to adjust modality contributions dynamically. Besides, we build a multi-level multimodal representations framework to obtain modality-specific and modality-shared semantics by the multi-task learning strategy, where modality-specific semantics contain complementary information and modality-shared semantics include consensus information. Extensive experiments are conducted on four publicly available datasets: MOSI, MOSEI, SIMS, and SIMSV2(s), demonstrating that our model exhibits superior or comparable performance to state-of-the-art models. The achieved accuracies are 86.28%, 86.07%, 84.46%, and 82.78%, respectively, showcasing improvements of 0.82%, 0.84%, 1.75%, and 1.83%. Further analyses also indicate the effectiveness of our model in sentiment analysis. © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024.
引用
收藏
页码:1491 / 1508
页数:17
相关论文
共 61 条
  • [21] Ma Y., Hao Y., Chen M., Et al., Audio-visual emotion fusion (avef): a deep efficient weighted approach, Inf Fusion, 46, pp. 184-192, (2019)
  • [22] Zadeh A., Chen M., Poria S., Tensor fusion network for multimodal sentiment analysis, In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing., pp. 1103-1114, (2017)
  • [23] Liu Z., Shen Y., Lakshminarasimhan V.B., Efficient Low-Rank Multimodal Fusion with Modality-Specific Factors, (2018)
  • [24] Jin T., Huang S., Li Y., Dual low-rank multimodal fusion, . In: Findings of the Association for Computational Linguistics: EMNLP 2020., pp. 377-387, (2020)
  • [25] Fu Z., Liu F., Xu Q., Et al., Nhfnet: a non-homogeneous fusion network for multimodal sentiment analysis, 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1-6, (2022)
  • [26] Kim K., Park S., Aobert: all-modalities-in-one bert for multimodal sentiment analysis, Inf Fusion, 92, pp. 37-45, (2023)
  • [27] Zadeh A., Liang P.P., Poria S., Et al., Multi-attention recurrent network for human communication comprehension, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, (2018)
  • [28] Zadeh A., Liang P.P., Mazumder N., Memory fusion network for multi-view sequential learning., (2018)
  • [29] Bagher Zadeh A., Liang P.P., Poria S., Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph, In: Proceedings of the 56Th Annual Meeting of the Association for Computational Linguistics, 1, pp. 2236-2246, (2018)
  • [30] Wang Y., Shen Y., Liu Z., Et al., Words can shift: Dynamically adjusting word representations using nonverbal behaviors, Proceedings of the AAAI Conference on Artificial Intelligence, 33, 1, pp. 7216-7223, (2019)