A Unimodal Reinforced Transformer With Time Squeeze Fusion for Multimodal Sentiment Analysis

被引：24

作者：

He, Jiaxuan ^{[1
]}

Mai, Sijie ^{[1
]}

Hu, Haifeng ^{[1
]}

机构：

[1] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou 510275, Peoples R China

来源：

IEEE SIGNAL PROCESSING LETTERS | 2021年 / 28卷

基金：

中国国家自然科学基金;

关键词：

Sparse matrices; Sentiment analysis; Fuses; Convolution; Kernel; Analytical models; Visualization; Time squeeze fusion; unimodal reinforced transformer; multimodal sentiment analysis; EMOTION RECOGNITION; NETWORK;

D O I：

10.1109/LSP.2021.3078074

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Multimodal sentiment analysis refers to inferring sentiment from language, acoustic, and visual sequences. Previous studies focus on analyzing aligned sequences, while the unaligned sequential analysis is more practical in real-world scenarios. Due to the long-time dependency hidden in the multimodal unaligned sequence and time alignment information is not provided, exploring the time-dependent interactions within unaligned sequences is more challenging. To this end, we introduce the time squeeze fusion to automatically explore the time-dependent interactions by modeling the unimodal and multimodal sequences from the perspective of compressing the time dimension. Moreover, prior methods tend to fuse unimodal features into a multimodal embedding, based on which sentiment is inferred. However, we argue that the unimodal information may be lost or the generated multimodal embedding may be redundant. Addressing this issue, we propose a unimodal reinforced Transformer to progressively attend and distill unimodal information from the multimodal embedding, which enables the multimodal embedding to highlight the discriminative unimodal information. Extensive experiments suggest that our model reaches state-of-the-art performance in terms of accuracy and F1 score on MOSEI dataset.

引用

页码：992 / 996

页数：5

共 26 条

[1]

[Anonymous], 2006, P 23 INT C MACHINE L, DOI 10.1145/1143844.1143891

[2] Multi-Modal Emotion Recognition by Fusing Correlation Features of Speech-Visual [J].

Chen Guanghui ;

Zeng Xiaoping .

IEEE SIGNAL PROCESSING LETTERS, 2021, 28 :533-537

[3]

Cho K., 2014, ARXIV14061078, DOI 10.3115/v1/D14-1179

[4] Universum Autoencoder-Based Domain Adaptation for Speech Emotion Recognition [J].

Deng, Jun ;

Xu, Xinzhou ;

Zhang, Zixing ;

Fruhholz, Sascha ;

Schuller, Bjorn .

IEEE SIGNAL PROCESSING LETTERS, 2017, 24 (04) :500-504

[5] Autoencoder-based Unsupervised Domain Adaptation for Speech Emotion Recognition [J].

Deng, Jun ;

Zhang, Zixing ;

Eyben, Florian ;

Schuller, Bjoern .

IEEE SIGNAL PROCESSING LETTERS, 2014, 21 (09) :1068-1072

[6] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[7]

Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]

[8]

Kampman O, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2, P606

[9]

Liu Z, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P2247

[10]

Mai S, 2020, ANAL UNALIGNED MULTI

← 1 2 3 →