Information Fusion in Attention Networks Using Adaptive and Multi-Level Factorized Bilinear Pooling for Audio-Visual Emotion Recognition

被引:46
作者
Zhou, Hengshun [1 ]
Du, Jun [1 ]
Zhang, Yuanyuan [1 ]
Wang, Qing [1 ]
Liu, Qing-Feng [1 ]
Lee, Chin-Hui [2 ]
机构
[1] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei 230027, Peoples R China
[2] Georgia Inst Technol, Sch Elect & Comp Engn, Atlanta, GA 30332 USA
关键词
Emotion recognition; Speech recognition; Face recognition; Feature extraction; Visualization; Hidden Markov models; Speech processing; Factorized bilinear pooling; local response normalization; multi-level and adaptive fusion; attention network; multimodal emotion recognition; PARKINSONS-DISEASE; NEURAL-NETWORKS; SPEECH; IDENTIFICATION; RECURRENT; FEATURES; MODEL;
D O I
10.1109/TASLP.2021.3096037
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Multimodal emotion recognition is a challenging task in emotion computing as it is quite difficult to extract discriminative features to identify the subtle differences in human emotions with abstract concept and multiple expressions. Moreover, how to fully utilize both audio and visual information is still an open problem. In this paper, we propose a novel multimodal fusion attention network for audio-visual emotion recognition based on adaptive and multi-level factorized bilinear pooling (FBP). First, for the audio stream, a fully convolutional network (FCN) equipped with 1-D attention mechanism and local response normalization is designed for speech emotion recognition. Next, a global FBP (G-FBP) approach is presented to perform audio-visual information fusion by integrating self-attention based video stream with the proposed audio stream. To improve G-FBP, an adaptive strategy (AG-FBP) to dynamically calculate the fusion weight of two modalities is devised based on the emotion-related representation vectors from the attention mechanism of respective modalities. Finally, to fully utilize the local emotion information, adaptive and multi-level FBP (AM-FBP) is introduced by combining both global-trunk and intra-trunk data in one recording on top of AG-FBP. Tested on the IEMOCAP corpus for speech emotion recognition with only audio stream, the new FCN method outperforms the state-of-the-art results with an accuracy of 71.40%. Moreover, validated on the AFEW database of EmotiW2019 sub-challenge and the IEMOCAP corpus for audio-visual emotion recognition, the proposed AM-FBP approach achieves the best accuracy of 63.09% and 75.49% respectively on the test set.
引用
收藏
页码:2617 / 2629
页数:13
相关论文
共 78 条
[1]  
Aldeneh Z, 2017, INT CONF ACOUST SPEE, P2741, DOI 10.1109/ICASSP.2017.7952655
[2]  
[Anonymous], 2019, INT J ENG ADV TECHNO
[3]  
[Anonymous], 2013, 1365 U MONTR
[4]   Audiovisual emotion recognition in wild [J].
Avots, Egils ;
Sapinski, Tomasz ;
Bachmann, Maie ;
Kaminska, Dorota .
MACHINE VISION AND APPLICATIONS, 2019, 30 (05) :975-985
[5]  
Bai MC, 2019, IEEE IMAGE PROC, P31, DOI [10.1109/ICIP.2019.8802941, 10.1109/icip.2019.8802941]
[6]   Implementation of wavelet packet transform and non linear analysis for emotion classification in stroke patient using brain signals [J].
Bong, Siao Zheng ;
Wan, Khairunizam ;
Murugappan, M. ;
Ibrahim, Norlinah Mohamed ;
Rajamanickam, Yuvaraj ;
Mohamad, Khairiyah .
BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2017, 36 :102-112
[7]   Faces of emotion in Parkinsons disease: Micro-expressivity and bradykinesia during voluntary facial expressions [J].
Bowers, Dawn ;
Miller, Kimberly ;
Bosch, Wendelyn ;
Gokcay, Didem ;
Pedraza, Otto ;
Springer, Utaka ;
Okun, Michael .
JOURNAL OF THE INTERNATIONAL NEUROPSYCHOLOGICAL SOCIETY, 2006, 12 (06) :765-773
[8]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[9]   Data Augmentation using GANs for Speech Emotion Recognition [J].
Chatziagapi, Aggelina ;
Paraskevopoulos, Georgios ;
Sgouropoulos, Dimitris ;
Pantazopoulos, Georgios ;
Nikandrou, Malvina ;
Giannakopoulos, Theodoros ;
Katsamanis, Athanasios ;
Potamianos, Alexandros ;
Narayanan, Shrikanth .
INTERSPEECH 2019, 2019, :171-175
[10]   3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition [J].
Chen, Mingyi ;
He, Xuanji ;
Yang, Jing ;
Zhang, Han .
IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (10) :1440-1444