A Unimodal Reinforced Transformer With Time Squeeze Fusion for Multimodal Sentiment Analysis

被引:21
作者
He, Jiaxuan [1 ]
Mai, Sijie [1 ]
Hu, Haifeng [1 ]
机构
[1] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou 510275, Peoples R China
基金
中国国家自然科学基金;
关键词
Sparse matrices; Sentiment analysis; Fuses; Convolution; Kernel; Analytical models; Visualization; Time squeeze fusion; unimodal reinforced transformer; multimodal sentiment analysis; EMOTION RECOGNITION;
D O I
10.1109/LSP.2021.3078074
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Multimodal sentiment analysis refers to inferring sentiment from language, acoustic, and visual sequences. Previous studies focus on analyzing aligned sequences, while the unaligned sequential analysis is more practical in real-world scenarios. Due to the long-time dependency hidden in the multimodal unaligned sequence and time alignment information is not provided, exploring the time-dependent interactions within unaligned sequences is more challenging. To this end, we introduce the time squeeze fusion to automatically explore the time-dependent interactions by modeling the unimodal and multimodal sequences from the perspective of compressing the time dimension. Moreover, prior methods tend to fuse unimodal features into a multimodal embedding, based on which sentiment is inferred. However, we argue that the unimodal information may be lost or the generated multimodal embedding may be redundant. Addressing this issue, we propose a unimodal reinforced Transformer to progressively attend and distill unimodal information from the multimodal embedding, which enables the multimodal embedding to highlight the discriminative unimodal information. Extensive experiments suggest that our model reaches state-of-the-art performance in terms of accuracy and F1 score on MOSEI dataset.
引用
收藏
页码:992 / 996
页数:5
相关论文
共 26 条
  • [1] Multi-Modal Emotion Recognition by Fusing Correlation Features of Speech-Visual
    Chen Guanghui
    Zeng Xiaoping
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 533 - 537
  • [2] Cho K, 2014, ARXIV14061078, P1724, DOI DOI 10.3115/V1/D14-1179
  • [3] Universum Autoencoder-Based Domain Adaptation for Speech Emotion Recognition
    Deng, Jun
    Xu, Xinzhou
    Zhang, Zixing
    Fruhholz, Sascha
    Schuller, Bjorn
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2017, 24 (04) : 500 - 504
  • [4] Autoencoder-based Unsupervised Domain Adaptation for Speech Emotion Recognition
    Deng, Jun
    Zhang, Zixing
    Eyben, Florian
    Schuller, Bjoern
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2014, 21 (09) : 1068 - 1072
  • [5] Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
  • [6] Graves A., 2006, P 23 INT C MACH LEAR, P369, DOI 10.1145/1143844.1143891
  • [7] He Kaiming, 2015, C COMP VIS PATT REC
  • [8] Kampman O, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2, P606
  • [9] Liu Z, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P2247
  • [10] Multi-Fusion Residual Memory Network for Multimodal Human Sentiment Comprehension
    Mai, Sijie
    Hu, Haifeng
    Xu, Jia
    Xing, Songlong
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2022, 13 (01) : 320 - 334