Self-supervised Multimodal Emotion Recognition Combining Temporal Attention Mechanism and Unimodal Label Automatic Generation Strategy

被引:0
作者
Sun, Qiang [1 ,2 ]
Wang, Shuyu [1 ]
机构
[1] Xian Univ Technol, Sch Automat & Informat Engn, Dept Commun Engn, Xian 710048, Peoples R China
[2] Xian Key Lab Wireless Opt Commun & Network Res, Xian 710048, Peoples R China
关键词
Multimodal emotion recognition; Self-supervised label generation; Multi-task learning; Temporal Attention mechanism; Multimodal fusion; SENTIMENT ANALYSIS; TRANSFORMER; FUSION;
D O I
10.11999/JEIT231107
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Most multimodal emotion recognition methods aim to find an effective fusion mechanism to construct the features from heterogeneous modalities, so as to learn the feature representation with semantic consistency. However, these methods usually ignore the emotionally semantic differences between modalities. To solve this problem, one multi-task learning framework is proposed. By training one multimodal task and three unimodal tasks jointly, the emotionally semantic consistency information among multimodal features and the emotionally semantic difference information contained in each modality are respectively learned. Firstly, in order to learn the emotionally semantic consistency information, one Temporal Attention Mechanism (TAM) based on a multilayer recurrent neural network is proposed. The contribution degree of emotional features is described by assigning different weights to time series feature vectors. Then, for multimodal fusion, the fine-grained feature fusion per semantic dimension is carried out in the semantic space. Secondly, one self-supervised Unimodal Label Automatic Generation (ULAG) strategy based on the inter-modal feature vector similarity is proposed in order to effectively learn the difference information of emotional semantics in each modality. A large number of experimental results on three datasets CMU-MOSI, CMU-MOSEI, CH-SIMS, confirm that the proposed TAM-ULAGmodel has strong competitiveness, and has improved the classification indices (Acc(2), F-1) and regression index (MAE, Corr) compared with the current benchmark models. For binary classification, the recognition rate is 87.2% and 85.8% on the CMU-MOSEI and CMU-MOSEI datasets, and 81.47% on the CH-SIMS dataset. The results show that simultaneously learning the emotionally semantic consistency information and the emotionally semantic difference information for each modality is helpful in improving the performance of self-supervised multimodal emotion recognition method.
引用
收藏
页码:588 / 601
页数:14
相关论文
共 42 条
[1]   CyTex: Transforming speech to textured images for speech emotion recognition [J].
Bakhshi, Ali ;
Harimi, Ali ;
Chalup, Stephan .
SPEECH COMMUNICATION, 2022, 139 :62-75
[2]   Multimodal Machine Learning: A Survey and Taxonomy [J].
Baltrusaitis, Tadas ;
Ahuja, Chaitanya ;
Morency, Louis-Philippe .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) :423-443
[3]  
Choi E, 2016, ADV NEUR IN, V29
[4]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[5]  
Ghosal D, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P3454
[6]  
Han W, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P9180
[7]  
HAN Zhuoqun, 2022, Master dissertation, DOI [10.27272/d.cnki.gshdu.2022.004451, DOI 10.27272/D.CNKI.GSHDU.2022.004451]
[8]   MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis [J].
Hazarika, Devamanyu ;
Zimmermann, Roger ;
Poria, Soujanya .
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :1122-1131
[9]   A Unimodal Reinforced Transformer With Time Squeeze Fusion for Multimodal Sentiment Analysis [J].
He, Jiaxuan ;
Mai, Sijie ;
Hu, Haifeng .
IEEE SIGNAL PROCESSING LETTERS, 2021, 28 :992-996
[10]  
[黄程韦 Huang Chengwei], 2011, [电子与信息学报, Journal of Electronics & Information Technology], V33, P112, DOI 10.3724/SP.J.1146.2009.00886