Transformer Encoder With Multi-Modal Multi-Head Attention for Continuous Affect Recognition

被引:67
作者
Chen, Haifeng [1 ,2 ]
Jiang, Dongmei [1 ,2 ]
Sahli, Hichem [3 ,4 ]
机构
[1] Northwestern Polytech Univ, Natl Engn Lab Integrated Aerosp Ground Ocean Big, NPU VUB Joint AVSP Res Lab, Sch Comp Sci,Shaanxi Key Lab Speech & Image Infor, Xian 710072, Peoples R China
[2] Peng Cheng Lab, Shenzhen 518055, Peoples R China
[3] Vrije Univ Brussel, Dept Elect & Informat, VUB NPU Joint AVSP Res Lab, B-1050 Brussels, Belgium
[4] Interuniv Microelect Ctr, B-3001 Heverlee, Belgium
关键词
Emotion recognition; Context modeling; Feature extraction; Correlation; Computational modeling; Visualization; Redundancy; Multi modal affective state recognition; self-attention; temporal dependency; multi-modal multi-head attention; inter-modality interaction; EMOTION RECOGNITION; NETWORKS; MEMORY;
D O I
10.1109/TMM.2020.3037496
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Continuous affect recognition is becoming an increasingly attractive research topic in affective computing. Previous works mainly focused on modelling the temporal dependency within a sensor modality, or adopting early or late fusion for multi-modal affective state recognition. However, early fusion suffers from the curse of dimensionality, and late fusion ignores the complementarity and redundancy between multiple modal streams. In this paper, we first introduce the transformer-encoder with a self-attention mechanism and propose a Convolutional Neural Network-Transformer Encoder (CNN-TE) framework to model the temporal dependency for single modal affect recognition. Further, to effectively consider the complementarity and redundancy between multiple streams we propose a Transformer Encoder with Multi-modal Multi-head Attention (TEMMA) for multi-modal affect recognition. TEMMA allows to progressively and simultaneously refine the inter-modality interactions and intra-modality temporal dependency. The learned multi-modal representations are fed to an Inference Sub-network with fully connected layers to estimate the affective state. The proposed framework is trained in a nutshell and demonstrates its effectiveness on the AVEC2016 and AVEC2019 datasets. Compared to state-of-the-art models, our approach obtains remarkable improvements on both arousal and valence in terms of concordance correlation coefficient (CCC) reaching 0.583 for arousal and 0.564 for valence on the AVEC2019 test set.
引用
收藏
页码:4171 / 4183
页数:13
相关论文
共 67 条
[1]  
[Anonymous], 2014, AVEC 14 P 4 INT WORK
[2]  
[Anonymous], 2015, P 5 INT WORKSH AUD V, DOI DOI 10.1145/2808196
[3]   The control-value theory of achievement emotions: Assumptions, corollaries, and implications for educational research and practice [J].
Pekrun, Reinhard .
EDUCATIONAL PSYCHOLOGY REVIEW, 2006, 18 (04) :315-341
[4]  
Ba J. L., 2016, arXiv
[5]  
Bai S., 2018, ARXIV
[6]   Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction [J].
Brady, Kevin ;
Gwon, Youngjune ;
Khorrami, Pooya ;
Godoy, Elizabeth ;
Campbell, William ;
Dagli, Charlie ;
Huang, Thomas S. .
PROCEEDINGS OF THE 6TH INTERNATIONAL WORKSHOP ON AUDIO/VISUAL EMOTION CHALLENGE (AVEC'16), 2016, :97-104
[7]  
Chao L., 2015, P 5 INT WORKSH AUD V, P65, DOI DOI 10.1145/2808196.2811634
[8]  
Chen H., 2019, P 9 INT AUD VIS EM C, P19
[9]  
Chen S., 2017, P 7 ANN WORKSH AUD V, P19, DOI [DOI 10.1145/3133944.3133949, 10.1145/3133944.3133949, DOI 10.1145/3133944.3133949.21]
[10]  
Chen T., 2020, ARXIV200205709