Multi-modal Attention for Speech Emotion Recognition

被引：28

作者：

Pan, Zexu ^{[1
,2
]}

Luo, Zhaojie ^{[3
]}

Yang, Jichen ^{[4
]}

Li, Haizhou ^{[1
,4
]}

机构：

[1] NUS, Inst Data Sci, Singapore, Singapore

[2] NUS, Grad Sch Integrat Sci & Engn, Singapore, Singapore

[3] Osaka Univ, Osaka, Japan

[4] Natl Univ Singapore NUS, Dept Elect & Comp Engn, Singapore, Singapore

来源：

INTERSPEECH 2020 | 2020年

基金：

新加坡国家研究基金会;

关键词：

speech emotion recognition; multi-modal attention; early fusion; hybrid fusion; SENTIMENT ANALYSIS;

D O I：

10.21437/Interspeech.2020-1653

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Emotion represents an essential aspect of human speech that is manifested in speech prosody. Speech, visual, and textual cues are complementary in human communication. In this paper, we study a hybrid fusion method, referred to as multi-modal attention network (MMAN) to makes use of visual and textual cues in speech emotion recognition. We propose a novel multi-modal attention mechanism, cLSTM-MMA, which facilitates the attention across three modalities and selectively fuse the information. cLSTM-MMA is fused with other uni-modal sub-networks in the late fusion. The experiments show that speech emotion recognition benefits significantly from visual and textual cues, and the proposed cLSTM-MMA alone is as competitive as other fusion methods in terms of accuracy, but with a much more compact network structure. The proposed hybrid network MMAN achieves state-of-the-art performance on IEMOCAP database for emotion recognition.

引用

页码：364 / 368

页数：5

共 31 条

[1] Solving the emotion paradox: Categorization and the experience of emotion [J].

Barrett, LF .

PERSONALITY AND SOCIAL PSYCHOLOGY REVIEW, 2006, 10 (01) :20-46

[2] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[3] Benchmarking Multimodal Sentiment Analysis [J].

Cambria, Erik ;

Hazarika, Devamanyu ;

Poria, Soujanya ;

Hussain, Amir ;

Subramanyam, R. B. V. .

COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, CICLING 2017, PT II, 2018, 10762 :166-179

[4]

Cho JJ, 2019, Arxiv, DOI arXiv:1911.00432

[5]

Eyben F., 2010, P ACM INT C MULT, P1459

[6] Deep Hierarchical Fusion with application in Sentiment Analysis [J].

Georgiou, Efthymios ;

Papaioannou, Charilaos ;

Potamianos, Alexandros .

INTERSPEECH 2019, 2019, :1646-1650

[7] Emotion recognition using deep learning approach from audio-visual emotional big data [J].

Hossain, M. Shamim ;

Muhammad, Ghulam .

INFORMATION FUSION, 2019, 49 :69-78

[8]

Le H, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P5612

[9] 3D Convolutional Neural Networks for Human Action Recognition [J].

Ji, Shuiwang ;

Xu, Wei ;

Yang, Ming ;

Yu, Kai .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (01) :221-231

[10] Large-scale Video Classification with Convolutional Neural Networks [J].

Karpathy, Andrej ;

Toderici, George ;

Shetty, Sanketh ;

Leung, Thomas ;

Sukthankar, Rahul ;

Fei-Fei, Li .

2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :1725-1732

← 1 2 3 4 →