Transformer-based short-term memory attention for enhanced multimodal sentiment analysis

被引:1
作者
Shao, Dangguo [1 ,2 ]
Tang, Kaiqiang [1 ]
Li, Jingtao [1 ]
Yi, Sanli [1 ]
Ma, Lei [1 ]
机构
[1] Kunming Univ Sci & Technol, Fac Informat Engn & Automat, Kunming 650500, Peoples R China
[2] Kunming Univ Sci & Technol, Yunnan Key Lab Artificial Intelligence, Kunming 650500, Peoples R China
基金
中国国家自然科学基金;
关键词
Sentiment analysis; Memory attention; Multimodal fusion; Self-distillation; Modal interactions;
D O I
10.1007/s00371-025-03883-z
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In multimodal sentiment analysis, effectively utilizing and fusing information from multiple modalities remains a challenging task. Most existing studies focus on single-modal information, neglecting the potential of multimodal data. To address this, we propose a Transformer-based short-term memory attention (S-MA) model that captures both intra- and inter-modal interactions, learns the weight distribution between different modalities, and enhances modality representations. The model introduces a short-term memory attention module to retain significant features obtained from the previous training session, employing Transformer structures for both intra-modal and inter-modal interactions. Additionally, we introduce a self-distillation method that uses early-stage model outputs as soft labels to guide subsequent training, optimizing the model's representational capabilities. Experimental results on three public datasets demonstrate that the S-MA model outperforms previous state-of-the-art baselines, particularly excelling on the MVSA-Single and HFM datasets, with improvements of 1.98, 1.43 and 1.67, 1.75 percentage points in accuracy (ACC) and F1 metrics, respectively. The source code and datasets are available at [https://github.com/Doyken/S-MA].
引用
收藏
页数:16
相关论文
共 46 条
[1]   Emotion Recognition in Speech using Cross-Modal Transfer in the Wild [J].
Albanie, Samuel ;
Nagrani, Arsha ;
Vedaldi, Andrea ;
Zisserman, Andrew .
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, :292-301
[2]  
Cai YT, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2506
[3]   CTHFNet: contrastive translation and hierarchical fusion network for text-video-audio sentiment analysis [J].
Chen, Qiaohong ;
Xie, Shufan ;
Fang, Xian ;
Sun, Qi .
VISUAL COMPUTER, 2024, :4405-4418
[4]  
Chen YF, 2024, 2024 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2024, P3536
[5]  
Chung I, 2020, PR MACH LEARN RES, V119
[6]  
Colombo P, 2021, Arxiv, DOI arXiv:2109.00922
[7]  
De Vries H., 2017, ADV NEURAL INF PROCE, V30
[8]  
Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[9]  
Han W., 2021, arXiv
[10]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778