Multi-modal Attention for Speech Emotion Recognition

被引:28
作者
Pan, Zexu [1 ,2 ]
Luo, Zhaojie [3 ]
Yang, Jichen [4 ]
Li, Haizhou [1 ,4 ]
机构
[1] NUS, Inst Data Sci, Singapore, Singapore
[2] NUS, Grad Sch Integrat Sci & Engn, Singapore, Singapore
[3] Osaka Univ, Osaka, Japan
[4] Natl Univ Singapore NUS, Dept Elect & Comp Engn, Singapore, Singapore
来源
INTERSPEECH 2020 | 2020年
基金
新加坡国家研究基金会;
关键词
speech emotion recognition; multi-modal attention; early fusion; hybrid fusion; SENTIMENT ANALYSIS;
D O I
10.21437/Interspeech.2020-1653
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Emotion represents an essential aspect of human speech that is manifested in speech prosody. Speech, visual, and textual cues are complementary in human communication. In this paper, we study a hybrid fusion method, referred to as multi-modal attention network (MMAN) to makes use of visual and textual cues in speech emotion recognition. We propose a novel multi-modal attention mechanism, cLSTM-MMA, which facilitates the attention across three modalities and selectively fuse the information. cLSTM-MMA is fused with other uni-modal sub-networks in the late fusion. The experiments show that speech emotion recognition benefits significantly from visual and textual cues, and the proposed cLSTM-MMA alone is as competitive as other fusion methods in terms of accuracy, but with a much more compact network structure. The proposed hybrid network MMAN achieves state-of-the-art performance on IEMOCAP database for emotion recognition.
引用
收藏
页码:364 / 368
页数:5
相关论文
共 31 条
[1]   Solving the emotion paradox: Categorization and the experience of emotion [J].
Barrett, LF .
PERSONALITY AND SOCIAL PSYCHOLOGY REVIEW, 2006, 10 (01) :20-46
[2]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[3]   Benchmarking Multimodal Sentiment Analysis [J].
Cambria, Erik ;
Hazarika, Devamanyu ;
Poria, Soujanya ;
Hussain, Amir ;
Subramanyam, R. B. V. .
COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, CICLING 2017, PT II, 2018, 10762 :166-179
[4]  
Cho JJ, 2019, Arxiv, DOI arXiv:1911.00432
[5]  
Eyben F., 2010, P ACM INT C MULT, P1459
[6]   Deep Hierarchical Fusion with application in Sentiment Analysis [J].
Georgiou, Efthymios ;
Papaioannou, Charilaos ;
Potamianos, Alexandros .
INTERSPEECH 2019, 2019, :1646-1650
[7]   Emotion recognition using deep learning approach from audio-visual emotional big data [J].
Hossain, M. Shamim ;
Muhammad, Ghulam .
INFORMATION FUSION, 2019, 49 :69-78
[8]  
Le H, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P5612
[9]   3D Convolutional Neural Networks for Human Action Recognition [J].
Ji, Shuiwang ;
Xu, Wei ;
Yang, Ming ;
Yu, Kai .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (01) :221-231
[10]   Large-scale Video Classification with Convolutional Neural Networks [J].
Karpathy, Andrej ;
Toderici, George ;
Shetty, Sanketh ;
Leung, Thomas ;
Sukthankar, Rahul ;
Fei-Fei, Li .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :1725-1732