Transformer-based short-term memory attention for enhanced multimodal sentiment analysis

被引：1

作者：

Shao, Dangguo ^{[1
,2
]}

Tang, Kaiqiang ^{[1
]}

Li, Jingtao ^{[1
]}

Yi, Sanli ^{[1
]}

Ma, Lei ^{[1
]}

机构：

[1] Kunming Univ Sci & Technol, Fac Informat Engn & Automat, Kunming 650500, Peoples R China

[2] Kunming Univ Sci & Technol, Yunnan Key Lab Artificial Intelligence, Kunming 650500, Peoples R China

来源：

VISUAL COMPUTER | 2025年

基金：

中国国家自然科学基金;

关键词：

Sentiment analysis; Memory attention; Multimodal fusion; Self-distillation; Modal interactions;

D O I：

10.1007/s00371-025-03883-z

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

In multimodal sentiment analysis, effectively utilizing and fusing information from multiple modalities remains a challenging task. Most existing studies focus on single-modal information, neglecting the potential of multimodal data. To address this, we propose a Transformer-based short-term memory attention (S-MA) model that captures both intra- and inter-modal interactions, learns the weight distribution between different modalities, and enhances modality representations. The model introduces a short-term memory attention module to retain significant features obtained from the previous training session, employing Transformer structures for both intra-modal and inter-modal interactions. Additionally, we introduce a self-distillation method that uses early-stage model outputs as soft labels to guide subsequent training, optimizing the model's representational capabilities. Experimental results on three public datasets demonstrate that the S-MA model outperforms previous state-of-the-art baselines, particularly excelling on the MVSA-Single and HFM datasets, with improvements of 1.98, 1.43 and 1.67, 1.75 percentage points in accuracy (ACC) and F1 metrics, respectively. The source code and datasets are available at [https://github.com/Doyken/S-MA].

引用

页数：16

共 46 条

[1] Emotion Recognition in Speech using Cross-Modal Transfer in the Wild [J].

Albanie, Samuel ;

Nagrani, Arsha ;

Vedaldi, Andrea ;

Zisserman, Andrew .

PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, :292-301

[2]

Cai YT, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2506

[3] CTHFNet: contrastive translation and hierarchical fusion network for text-video-audio sentiment analysis [J].

Chen, Qiaohong ;

Xie, Shufan ;

Fang, Xian ;

Sun, Qi .

VISUAL COMPUTER, 2024, :4405-4418

[4]

Chen YF, 2024, 2024 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2024, P3536

[5]

Chung I, 2020, PR MACH LEARN RES, V119

[6]

Colombo P, 2021, Arxiv, DOI arXiv:2109.00922

[7]

De Vries H., 2017, ADV NEURAL INF PROCE, V30

[8]

Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929

[9]

Han W., 2021, arXiv

[10] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

← 1 2 3 4 5 →