Transformer-based correlation mining network with self-supervised label generation for multimodal sentiment analysis

被引：2

作者：

Wang, Ruiqing ^{[1
]}

Yang, Qimeng ^{[1
]}

Tian, Shengwei ^{[1
]}

Yu, Long ^{[2
]}

He, Xiaoyu ^{[3
]}

Wang, Bo ^{[1
]}

机构：

[1] Xinjiang Univ, Sch Software, Urumqi, Xinjiang, Peoples R China

[2] Xinjiang Univ, Network & Informat Ctr, Network, Xinjiang, Peoples R China

[3] Xinjiang Univ, Coll Informat Sci & Engn, Urumqi 830000, Peoples R China

来源：

NEUROCOMPUTING | 2025年 / 618卷

基金：

中国国家自然科学基金;

关键词：

Multimodal sentiment analysis; Transformer; Multimodal fusion; Collaborative learning; FUSION;

D O I：

10.1016/j.neucom.2024.129163

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multimodal Sentiment Analysis (MSA) aims to recognize and understand a speaker's sentiment state by integrating information from natural language, facial expressions, and voice, has gained much attention in recent years. However, modeling multimodal data poses two main challenges: 1) There are potential sentiment correlations between modalities and within contextual contexts, making it difficult to perform deep-level sentiment correlation mining and information fusion; 2) Sentiment information tends to be unevenly distributed across different modalities, posing challenges in fully leveraging information from each modality for collaborative learning. To address the above challenges, we propose CMLG based on correlation mining and label generation. This approach utilizes a Squeeze and Excitation Network (SEN) to recalibrate modality features and employs Transformer-based intra-modal and inter-modal feature extractors to mine the intrinsic connections between different modalities. In addition, we designed a Self-Supervised Label Generation Module (SLGM) that relies on the positive correlation between feature distances and label offsets to generate single-peak labels, and jointly train multi-peak and single-peak tasks to detect sentiment differences. Extensive experiments on three benchmark dataset (MOSI, MOSEI and SIMS) have shown that the above proposed method CMLG achieves excellent results.

引用

页数：9

共 41 条

[1]

Akhtar MS, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P370

[2] OpenFace 2.0: Facial Behavior Analysis Toolkit [J].

Baltrusaitis, Tadas ;

Zadeh, Amir ;

Lim, Yao Chong ;

Morency, Louis-Philippe .

PROCEEDINGS 2018 13TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION (FG 2018), 2018, :59-66

[3]

Degottex G, 2014, INT CONF ACOUST SPEE, DOI 10.1109/ICASSP.2014.6853739

[4]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[5] Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions [J].

Gandhi, Ankita ;

Adhvaryu, Kinjal ;

Poria, Soujanya ;

Cambria, Erik ;

Hussain, Amir .

INFORMATION FUSION, 2023, 91 :424-444

[6] MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis [J].

Hazarika, Devamanyu ;

Zimmermann, Roger ;

Poria, Soujanya .

MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :1122-1131

[7]

Hu G., 2022, P 2022 C EMP METH NA, P7837

[8]

Hu J, 2018, PROC CVPR IEEE, P7132, DOI [10.1109/CVPR.2018.00745, 10.1109/TPAMI.2019.2913372]

[9] Knowing What to Learn: A Metric-Oriented Focal Mechanism for Image Captioning [J].

Ji, Jiayi ;

Ma, Yiwei ;

Sun, Xiaoshuai ;

Zhou, Yiyi ;

Wu, Yongjian ;

Ji, Rongrong .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 :4321-4335

[10] MMTM: Multimodal Transfer Module for CNN Fusion [J].

Joze, Hamid Reza Vaezi ;

Shaban, Amirreza ;

Iuzzolino, Michael L. ;

Koishida, Kazuhito .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :13286-13296

← 1 2 3 4 5 →