Enhancing Cross-Modal Alignment in Multimodal Sentiment Analysis via Prompt Learning

被引：0

作者：

Wang, Xiaofan ^{[1
]}

Li, Xiuhong ^{[1
]}

Li, Zhe ^{[2
,3
]}

Zhou, Chenyu ^{[1
]}

Chen, Fan ^{[1
]}

Yang, Dan ^{[1
]}

机构：

[1] Xinjiang Univ, Sch Informat Sci & Engn, Urumqi, Peoples R China

[2] Hong Kong Polytech Univ, Dept Elect & Elect Engn, Hong Kong, Peoples R China

[3] Stanford Univ, Dept Elect Engn, Stanford, CA 94305 USA

来源：

PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024 | 2025年 / 15035卷

关键词：

Prompt learning; Multimodal Sentiment Analysis; Alignment;

D O I：

10.1007/978-981-97-8620-6_37

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multimodal sentiment analysis (MSA) aims to predict the sentiment expressed in paired images and texts. Cross-modal feature alignment is crucial for models to understand the context and extract complementary semantic features. However, most previous MSA tasks have shown deficiencies in aligning features across different modalities. Experimental evidence shows that prompt learning can effectively align features, and previous studies have applied prompt learning to MSA tasks, but only in an unimodal context. Applying prompt learning to multimodal feature alignment remains a challenge. This paper employs a multimodal sentiment analysis model based on alignment prompts (MSAPL). Our model generates text and image alignment prompts via the Kronecker Product, enhancing visual modality engagement and the correlation between graphical and textual data, thus enabling a better understanding of multimodal data. Simultaneously, it employs a multi-layer, stepwise learning approach to acquire textual and image features, progressively modeling stage-feature relationships for rich contextual learning. Our experiments on three public datasets demonstrate that our model consistently outperforms all baseline models.

引用

页码：541 / 554

页数：14

共 29 条

[1] Cai YT, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2506
[2] Chen L., 2022, P 2021 4 INT C ALG C
[3] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[4] Dosovitskiy A., 2021, ARXIV, P1, DOI 10.48550/ARXIV.2010.11929
[5] Multi-Modal Representation via Contrastive Learning with Attention Bottleneck Fusion and Attentive Statistics Features
Guo, Qinglang
Liao, Yong
Li, Zhe
Liang, Shenglin
[J]. ENTROPY, 2023, 25 (10)
[6] Deep Residual Learning for Image Recognition
He, Kaiming
Zhang, Xiangyu
Ren, Shaoqing
Sun, Jian
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
[7] Hu G., 2022, P 2022 C EMPIRICAL M, P7837, DOI DOI 10.18653/V1/2022.EMNLP-MAIN.534
[8] Image-text sentiment analysis via deep multimodal attentive fusion
Huang, Feiran
Zhang, Xiaoming
Zhao, Zhonghua
Xu, Jie
Li, Zhoujun
[J]. KNOWLEDGE-BASED SYSTEMS, 2019, 167 : 26 - 37
[9] Huang LZ, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P3444
[10] Knowledge-Guided Sentiment Analysis Via Learning From Natural Language Explanations
Ke, Zunwang
Sheng, Jiabao
Li, Zhe
Silamu, Wushour
Guo, Qinglang
[J]. IEEE ACCESS, 2021, 9 : 3570 - 3578

← 1 2 3 →