Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal Sentiment Analysis

被引:65
作者
Mai, Sijie [1 ]
Zeng, Ying [1 ]
Zheng, Shuangjia [1 ]
Hu, Haifeng [2 ]
机构
[1] Sun Yat Sen Univ, Guangzhou 510275, Peoples R China
[2] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou 510275, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Multimodal sentiment analysis; supervised contrastive learning; representation learning; multimodal learning; FUSION; LANGUAGE;
D O I
10.1109/TAFFC.2022.3172360
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The wide application of smart devices enables the availability of multimodal data, which can be utilized in many tasks. In the field of multimodal sentiment analysis, most previous works focus on exploring intra- and inter-modal interactions. However, training a network with cross-modal information (language, audio and visual) is still challenging due to the modality gap. Besides, while learning dynamics within each sample draws great attention, the learning of inter-sample and inter-class relationships is neglected. Moreover, the size of datasets limits the generalization ability of the models. To address the afore-mentioned issues, we propose a novel framework HyCon for hybrid contrastive learning of tri-modal representation. Specifically, we simultaneously perform intra-/inter-modal contrastive learning and semi-contrastive learning, with which the model can fully explore cross-modal interactions, learn inter-sample and inter-class relationships, and reduce the modality gap. Besides, refinement term and modality margin are introduced to enable a better learning of unimodal pairs. Moreover, we devise pair selection mechanism to identify and assign weights to the informative negative and positive pairs. HyCon can naturally generate many training pairs for better generalization and reduce the negative effect of limited datasets. Extensive experiments demonstrate that our method outperforms baselines on multimodal sentiment analysis and emotion recognition.
引用
收藏
页码:2276 / 2289
页数:14
相关论文
共 65 条
  • [1] Graph-based multimodal fusion with metric learning for multimodal classification
    Angelou, Michalis
    Solachidis, Vassilis
    Vretos, Nicholas
    Daras, Petros
    [J]. PATTERN RECOGNITION, 2019, 95 : 296 - 307
  • [2] Multimodal Machine Learning: A Survey and Taxonomy
    Baltrusaitis, Tadas
    Ahuja, Chaitanya
    Morency, Louis-Philippe
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) : 423 - 443
  • [3] Behmanesh M, 2021, Arxiv, DOI arXiv:2111.13361
  • [4] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
  • [5] Chen T, 2020, PR MACH LEARN RES, V119
  • [6] Cho K., 2014, P 2014 C EMP METH NA, DOI 10.3115/v1/d14-1179
  • [7] Degottex G, 2014, INT CONF ACOUST SPEE, DOI 10.1109/ICASSP.2014.6853739
  • [8] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [9] What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis
    Gkoumas, Dimitris
    Li, Qiuchi
    Lioma, Christina
    Yu, Yijun
    Song, Dawei
    [J]. INFORMATION FUSION, 2021, 66 : 184 - 197
  • [10] Hassani K, 2020, PR MACH LEARN RES, V119