Applying Segment-Level Attention on Bi-Modal Transformer Encoder for Audio-Visual Emotion Recognition

被引：15

作者：

Hsu, Jia-Hao ^{[1
]}

Wu, Chung-Hsien ^{[2
]}

机构：

[1] Natl Cheng Kung Univ, Grad Comp Sci & Informat Engn, Tainan 70101, Taiwan

[2] Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan

来源：

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING | 2023年 / 14卷 / 04期

关键词：

Audio-visual emotion recognition; bi-modal transformer; segment-level attention; SPEECH; HMM;

D O I：

10.1109/TAFFC.2023.3258900

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Emotions can be expressed through multiple complementary modalities. This study selected speech and facial expressions as modalities by which to recognize emotions. Current audiovisual emotion recognition models perform supervised learning using signal-level inputs. Such models are presumed to characterize the temporal relationships in signals. In this study, supervised learning was performed on segment-level signals, which are more granular than signal-level signals, to precisely train an emotion recognition model. Effectively fusing multimodal signals is challenging. In this study, sequential segments of audiovisual signals were obtained, and features were extracted and applied to estimate segment-level attention weights according to the emotional consistency of the two modalities using a neural tensor network. A proposed bimodal Transformer Encoder was trained using signal-level and segment-level emotion labels in which temporal context was incorporated into the signals to improve upon existing emotion recognition models. In bimodal emotion recognition, the experimental results demonstrated that the proposed method achieved 74.31% accuracy (3.05% higher than the method of fusing correlation features) on the audio-visual emotion dataset BAUM-1, which is based on fivefold cross-validation, and 76.81% accuracy (2.57% higher than the Multimodal Transformer Encoder) on the multimodal emotion data set CMU-MOSEI, which is composed of training, validation, and testing sets.

引用

页码：3231 / 3243

页数：13

共 69 条

[1]

Adedokun O., 2012, J Multidiscip Eval, V8, P125, DOI DOI 10.56645/JMDE.V8I17.336

[2]

ALPAYDM E, 1999, IEEE NEURAL COMPUT, V11, P1885

[3]

Badaro Gilbert, 2016, P 22 ACM SIGKDD C KN

[4] Introducing the Geneva Multimodal Expression Corpus for Experimental Research on Emotion Perception [J].