Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal Sentiment Analysis

被引：65

作者：

Mai, Sijie ^{[1
]}

Zeng, Ying ^{[1
]}

Zheng, Shuangjia ^{[1
]}

Hu, Haifeng ^{[2
]}

机构：

[1] Sun Yat Sen Univ, Guangzhou 510275, Peoples R China

[2] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou 510275, Peoples R China

来源：

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING | 2023年 / 14卷 / 03期

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

Multimodal sentiment analysis; supervised contrastive learning; representation learning; multimodal learning; FUSION; LANGUAGE;

D O I：

10.1109/TAFFC.2022.3172360

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The wide application of smart devices enables the availability of multimodal data, which can be utilized in many tasks. In the field of multimodal sentiment analysis, most previous works focus on exploring intra- and inter-modal interactions. However, training a network with cross-modal information (language, audio and visual) is still challenging due to the modality gap. Besides, while learning dynamics within each sample draws great attention, the learning of inter-sample and inter-class relationships is neglected. Moreover, the size of datasets limits the generalization ability of the models. To address the afore-mentioned issues, we propose a novel framework HyCon for hybrid contrastive learning of tri-modal representation. Specifically, we simultaneously perform intra-/inter-modal contrastive learning and semi-contrastive learning, with which the model can fully explore cross-modal interactions, learn inter-sample and inter-class relationships, and reduce the modality gap. Besides, refinement term and modality margin are introduced to enable a better learning of unimodal pairs. Moreover, we devise pair selection mechanism to identify and assign weights to the informative negative and positive pairs. HyCon can naturally generate many training pairs for better generalization and reduce the negative effect of limited datasets. Extensive experiments demonstrate that our method outperforms baselines on multimodal sentiment analysis and emotion recognition.

引用

页码：2276 / 2289

页数：14

共 65 条

[31] Deep Multimodal Fusion for Persuasiveness Prediction
Nojavanasghari, Behnaz
Gopinath, Deepak
Koushik, Jayanth
Baltrusaitis, Tadas
Morency, Louis-Philippe
[J]. ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, : 284 - 288
[32] FROM UTTERANCE TO TEXT - BIAS OF LANGUAGE IN SPEECH AND WRITING
OLSON, DR
[J]. HARVARD EDUCATIONAL REVIEW, 1977, 47 (03) : 257 - 281
[33] VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples
Pan, Tian
Song, Yibing
Yang, Tianyu
Jiang, Wenhao
Liu, Wei
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11200 - 11209
[34] Pennington J, 2014, P 2014 C EMP METH NA, P1532, DOI DOI 10.3115/V1/D14-1162
[35] Pham H, 2019, AAAI CONF ARTIF INTE, P6892
[36] Context-Dependent Sentiment Analysis in User-Generated Videos
Poria, Soujanya
Cambria, Erik
Hazarika, Devamanyu
Mazumder, Navonil
Zadeh, Amir
Morency, Louis-Philippe
[J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 873 - 883
[37] A review of affective computing: From unimodal analysis to multimodal fusion
Poria, Soujanya
Cambria, Erik
Bajpai, Rajiv
Hussain, Amir
[J]. INFORMATION FUSION, 2017, 37 : 98 - 125
[38] Poria S, 2016, IEEE DATA MINING, P439, DOI [10.1109/ICDM.2016.178, 10.1109/ICDM.2016.0055]
[39] Rahman W, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P2359, DOI 10.18653/v1/2020.acl-main.214
[40] Robinson Joshua David, 2021, INT C LEARNING REPRE

← 1 2 3 4 5 6 7 →