Applying Segment-Level Attention on Bi-Modal Transformer Encoder for Audio-Visual Emotion Recognition

被引:15
作者
Hsu, Jia-Hao [1 ]
Wu, Chung-Hsien [2 ]
机构
[1] Natl Cheng Kung Univ, Grad Comp Sci & Informat Engn, Tainan 70101, Taiwan
[2] Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan
关键词
Audio-visual emotion recognition; bi-modal transformer; segment-level attention; SPEECH; HMM;
D O I
10.1109/TAFFC.2023.3258900
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Emotions can be expressed through multiple complementary modalities. This study selected speech and facial expressions as modalities by which to recognize emotions. Current audiovisual emotion recognition models perform supervised learning using signal-level inputs. Such models are presumed to characterize the temporal relationships in signals. In this study, supervised learning was performed on segment-level signals, which are more granular than signal-level signals, to precisely train an emotion recognition model. Effectively fusing multimodal signals is challenging. In this study, sequential segments of audiovisual signals were obtained, and features were extracted and applied to estimate segment-level attention weights according to the emotional consistency of the two modalities using a neural tensor network. A proposed bimodal Transformer Encoder was trained using signal-level and segment-level emotion labels in which temporal context was incorporated into the signals to improve upon existing emotion recognition models. In bimodal emotion recognition, the experimental results demonstrated that the proposed method achieved 74.31% accuracy (3.05% higher than the method of fusing correlation features) on the audio-visual emotion dataset BAUM-1, which is based on fivefold cross-validation, and 76.81% accuracy (2.57% higher than the Multimodal Transformer Encoder) on the multimodal emotion data set CMU-MOSEI, which is composed of training, validation, and testing sets.
引用
收藏
页码:3231 / 3243
页数:13
相关论文
共 69 条
[1]  
Adedokun O., 2012, J Multidiscip Eval, V8, P125, DOI DOI 10.56645/JMDE.V8I17.336
[2]  
ALPAYDM E, 1999, IEEE NEURAL COMPUT, V11, P1885
[3]  
Badaro Gilbert, 2016, P 22 ACM SIGKDD C KN
[4]   Introducing the Geneva Multimodal Expression Corpus for Experimental Research on Emotion Perception [J].
Baenziger, Tanja ;
Mortillaro, Marcello ;
Scherer, Klaus R. .
EMOTION, 2012, 12 (05) :1161-1179
[5]  
Baevski A., 2020, Advances in neural information processing systems, V33, P12449
[6]  
Bozkurt Elif, 2010, 2010 IEEE 18th Signal Processing and Communications Applications Conference (SIU 2010), P216, DOI 10.1109/SIU.2010.5649919
[7]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[8]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[9]   Multi-Modal Emotion Recognition by Fusing Correlation Features of Speech-Visual [J].
Chen Guanghui ;
Zeng Xiaoping .
IEEE SIGNAL PROCESSING LETTERS, 2021, 28 :533-537
[10]  
Delbrouck J.B., 2020, A Transformer-Based JointEncoding for Emotion Recognition and Sentiment Analysis